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The  early  sections  of  this  report  are  introductory  In  nature. A brief 
description  of  the  DBC  and  a Bumniary  of  the  l]q>ortant  aspects  of  System  R have 
been  presented.  They  have  been  Included  so  that  a reader  without  a detailed 
imowledge  of  a relational  system  or  the  DBC  may  follow  the  rest  of  the  material 
wltl^etif  undue  difficulty. 

'^The  representation  of  relational  tuples  in  the  DBC  is  quite  straightforward. 
The  data  items  of  every  tuple  are  converted  to  attribute-value  pairs  to  form  a 
single  DBC  record.  Two  special  attribute-value  pairs  are  also  included  in  each 
DBC  record  in  order  to  indicate  the  relation  to  which  the  corresponding  tuple 
belongs ( and  to  provide  certain  clustering  information. 

User  transactions  in  the  data  sublanguage,  called  SEQUEL,  are  converted  to 
a series  of  DBC  commands.  The  commands  are  so  structured  that  the  DBC  can 
simultaneously  access  a number  of  records,  the  contents  of  which  satisfy  the 
predicates  in  a SEQUEL  query.  Given  a particular  command,  the  DBC  uses  its 
directory  to  determine  the  portions  of  its  secondary  storage  that  need  to  be 
content-searched.  The  report  further  demonstrates  how  view  mechanism, 
authorization,  ln»^rlty  assertions  and  triggers  may  be  supported. 

The  report  is  doncluded  with  an  analysis  of  the  memory  requirements  and 
query  execution  times  in  two  different  cases:  (1)  when  a conventional  conqiuter 
system  is  used  to  Impl'^ment  a relational  database  management  system  and  (2)  when 
the  DBC  is  used  in  conjwctlon  with  a front-end  computer  to  do  the  same  job. 

It  has  been  observed  that  the  mass  memory  requirement  of  the  conventional 
system  is  0.5  to  1.0  tlmds  that  of  the  DBC,  but  the  directory  memory  requirement 
is  one  or  more  orders  of  magnitude  greater  than  that  of  the  DBC.  Under  usual 
circumstances,  the  query  execution  time  of  the  DBC  is  also  faster  by  ten  to 
hundred  times,  and  sometimes  more. 
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1.  INTRODUCTION 


This  Is  Che  last  of  a series  of  three  reports  dealing  with  certain 
software  aspects  of  a database  computer,  known  as  the  DBC.  More  specifically, 
the  main  theme  underlying  the  studies  Is  the  demonstration  of  the  fact  that 
a database  machine  like  the  DBC  is  Indeed  capable  of  supporting  the  common 
data  models  at  a cost  which  Is  considerably  less  than  what  would  be  Incurred 
on  a conventional  general-purpose  computer.  Moreover,  certain  additional 
associative  retrieval  and  access  control  features  can  be  Incorporated  Into 
a database  management  system  without  any  significant  overhead  If  the  DBC 
Is  used  to  support  the  system.  In  the  first  report  [1]  of  this  series, 
the  DBC  software  requirements  have  been  presented  for  handling  hierarchical 
databases.  In  the  second  report  [2],  the  DBC  capabilities  were  studied  with 
a view  towards  the  support  of  network  (e.g.,  CODASYL)  databases.  Ve  shall 
now  conclude  this  series  by  directing  our  Investigation  on  the  software 
requirements  for  relational  databases. 

The  relational  model  of  data,  as  Introduced  by  Codd  [3}  In  1970,  Is  an 
approach  towards  providing  a data  model  or  view  which  Is  divorced  from 
various  Implementation  considerations  as  well  as  providing  the  database 
user  with  a high-level,  set-at-a-tlme  (rather  than  record-at-a-tlme)  data 
sublanguage.  Currently  there  exists  no  commercial  Implementation  of  any 
system  based  on  the  relational  data  model.  The  lack  of  commercial  adaption 
Is  not  due  to  any  Inherent  Inadequacy  of  the  relational  model,  since  It  Is  a 
very  simple  and  elegant  data  model  [4].  The  reluctance  among  the  commercial 
organizations  to  accept  and  Implement  relational  database  systems  can, 
perhaps,  be  attributed  to  the  fact  that  other  data  models  are  available.  It 
will  require  a major  effort  from  the  organizations  and  the  users  to  adapt 
a new  system  such  as  the  relational  one.  There  are,  however,  two  Important 
attempts  at  designing  and  Inq) lamenting  experimental  prototype  relational 
database  management  systems.  One  Is  called  System  R [5]  designed  and 
Implemented  at  the  IBM  San  Jose  Research  Laboratory  and  the  other  Is 
Ingres  [6]  developed  at  the  University  of  California,  Berkeley.  In  many 
respects,  the  two  systems  are  quite  similar.  We  shall  therefore  restrict 
our  attention  to  only  one  of  them,  namely.  System  R.  In  other  words,  for 
the  sake  of  specificity,  we  shall  assume,  and  perhaps  quite  justifiably, 
that  System  R does  possess  most  of  the  Important  features  expected  In  a 
relational  database  management  system. 
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Database  computers  are  a recent  addition  to  the  family  of  coiq>uter8. 

With  the  advent  of  large  databases,  there  has  been  a growing  awareness  of 
the  necessity  of  a computer  architecture  that  Is  oriented  towards  storage, 
retrieval  and  manipulation  of  large  quantities  of  Information.  The  DBC 
[7,8,9]  Is  a step  In  that  direction.  It  utilizes  content-addressable 
memories  and  processors  with  various  speeds  and  capacities.  In  addition  It 
provides  powerful  clustering  mechanisms  for  performance  enhancement  and  security 
mechanisms  for  access  control.  The  bullt-ln  hardware  data  structure  enables  the 
DBC  to  Interface  directly  with  existing  database  management  application  programs 
with  minimal  software.  In  other  words.  It  Is  the  purpose  of  this  report  to  show 
that  the  required  software  Is  minimal  and  that  the  new  software  can  replace 
existing  database  management  systems  with  Improved  performance. 

This  report  Is  organized  as  follows.  Sections  2 and  3 are  Introductions 
to  the  DBC  and  the  relational  database  management  systems,  respectively. 

In  Section  4,  we  demonstrate  how  relational  data  Is  stored  In  the  DBC.  In 
Section  5,  we  discuss  how  commands  In  the  relational  data  sublanguage  are 
translated  Into  DBC  queries.  We  propose  a methodology  In  Section  6 to 
support  relational  views  and  Integrity  features.  We  conclude  In  Sections 
7 and  8 with  an  analysis  of  DBC  performance  as  compared  to  the  performance 
of  a conventional  computer  In  managing  relational  databases. 


2.  THE  DATABASE  COMPUTER  (DBC) 


As  a special-purpose  computer,  the  DBC  Is  Intended  to  be  used  as  a 
back-end  machine  to  a front-end  conventional  computer.  It  Is  designed  to 
handle  very  large  databases  of  10^  to  10^®  bytes  In  an  efficient  manner. 

In  this  section,  we  shall  concentrate  on  the  major  architectural  features 
of  the  DBC. 

2.1  The  DBC  Data  Model 

Let  there  be  two  primitive  sets:  a set  AT  of  "attributes"  and  a set 
VA  of  "values".  The  meaning  of  the  two  sets  Is  assiuned  to  be  understood 
and  Is  left  otherwise  undefined  In  order  to  allow  for  the  broadest  possible 
Interpretation.  A record  R Is  a subset  of  the  Cartesian  product  AT  x VA, 
with  the  restriction  that  every  attribute  In  a record  Is  distinct.  Thus,  R 
Is  a set  of  ordered  pairs  of  the  form: 

<an  attribute,  a value>  . 

The  keywords  of  a record  (or  a group  of  records)  are  those  attribute-value 
pairs  which  characterize  the  record  (or  the  record  group),  l.e.,  those  pairs 
that  may  be  used  to  distinguish  the  record  (or  the  record  group)  from  all 
others.  The  other  attribute-value  pairs  of  a record.  If  any,  are  collectively 
called  the  record-body. 

The  set  of  all  records  which  are  stored  In  the  DBC  Is  called  the  database. 
The  database  may  be  partitioned  Into  subsets  called  files,  each  with  Its 
unique  file-name. 

A.  Query 

A keyword  predicate  Is  a triple  of  the  form: 

<attrlbute,  relational  operator,  value  >. 

A relational  operator  Is  an  element  of  the  set  {-,^,<,S,>,>  | . A keyword 
<A,V>  Is  said  to  satisfy  a keyword  predicate  <Ap,0p,Vp>  If  and  only  If  A“Ap 
and  V Op  Vp,  l.e.,  V and  Vp  are  related  by  the  operator  Op.  A query  Is  a 
Boolean  expression  of  keyword  predicates  In  dlsjimctlve  normal  form.  Thus, 
a query  Is  a dlsjimctlon  of  conjuncts  known  as  query  conjiincts,  where  a 
query  conjunct  Is  simply  a conjtinctlon  of  keyword  predicates.  A record  In 
a Mle  satisfies  a query  If  It  satisfies  at  least  one  query  conjunct  In  the 
query.  The  set  of  all  records  In  a file  that  satisfy  a query  will  be  called 
the  response  set  of  the  query. 


As  an  example  of  the  types  of  queries  that  may  be  recognized  by  the 

DBC,  consider  the  following: 

([DEPT-' TOY ']6i[SALARY<10000])  V 
( [DEPT- ' BOOK ' ] & [ SALARY>50000 ] ) . 

If  the  above  query  refers  to  a file  of  employees  of  a department  store,  then 
It  will  be  satisfied  by  records  of  the  eiiq>loyees  working  either  In  the  toy 
department  and  earning  less  than  10,000,  or  working  'n  the  book  department 
and  making  more  than  30,000. 

Queries  are  used  not  only  to  retrieve  a set  of  records  among  all  the 
records  In  the  database  but  also  to  specify  protection  requirements  and 
clustering  conditions. 

B.  Security 

The  DBC  allows  for  security  specifications  based  on  the  actual  contents 
of  the  database.  A database  access  or  simply  an  access  Is  the  name  of  a 
DBC  operation  which  transfers  Information  to  or  from  the  database.  Examples 
of  accesses  are  retrieve.  Insert  and  delete.  For  every  user  of  the  database, 
the  DBC  maintains  a database  capability,  which  Is  simply  a list  of  file 
sanctions  whose  entries  are  of  the  form: 

(F,[Q1,A1],[Q2,A2] [Qn.An]) 

where  F Is  a file  name,  each  Q1  Is  a query  and  each  A1  Is  a set  ■>£  accesses. 

The  database  capability  of  a user  determines  the  records  he  can  access. 

For  example,  for  a user  to  be  allowed  to  perform  an  access  operation  ^ on 
record  R of  file  F,  the  following  condition  must  hold  for  every  (Q1,A1)  In 
the  file  sanction  for  F: 

If  (R  satisfies  Ql)  then  €A1)  , 

This  type  of  security  specification  Is  powerful  and  elegant.  With  this 
specification,  not  only  can  security  be  enforced  In  terms  of  record  types 
or  entire  files,  but  security  can  also  be  facilitated  at  a much  more  detailed 
level  based  on  the  actual  content  of  the  records  In  the  database.  And  since 
such  a mechanism  Is  directly  provided  In  the  DBC,  It  may  easily  be  Incorporated 
Into  any  database  management  system  supported  by  the  DBC.  A more  detailed  and 
formal  discussion  of  the  DBC  security  provisions  will  be  found  In  [7]. 

C.  Clustering 

Based  on  certain  prespecified  Information  created  by  the  tiser,  clustering 
of  records  Is  done  automatically  by  the  DBC,  so  that  records  being  accessed 


together  are  stored  close  to  one  another.  This  Is  necessary  since  the  DBC 
Is  not  designed  to  be  fully  associative.  The  user  Is  provided  some  degree 
of  control  over  the  placement  of  records  by  application  of  the  concept  of 
clustering  keyword.  Certain  attributes  of  a file  may  be  designated  as 
clustering  attributes.  Keywords  whose  attributes  are  clustering  attributes 
are  termed  clustering  keywords , A cliister  Is  then  defined  as  a set  of 
records  all  of  which  have  the  same  set  of  clustering  keywrods.  Each  record 
In  the  file  will  then  belong  to  one  and  only  cluster.  The  user  may  now 
Impose  weighted  clustering  conditions  on  the  records.  A clustering  condition 
Is  a query  formed  with  clustering  keywords.  The  user,  when  Inserting  a 
record  In  the  database,  specifies  certain  clustering  conditions  and  their 
associated  weights.  A sum-of-welght  corresponding  to  the  above  clustering 
conditions  can  be  calculated  for  any  record  In  the  database  by  adding  the 
weights  of  those  clustering  conditions  that  are  satisfied  by  that  record. 

The  new  record  (the  record  to  be  Inserted)  is  then  placed  In  the  database 
close  to  an  existing  record  with  the  largest  sum-of-welght.  The  clustering 
process  does  not  really  require  the  Inspection  of  the  database,  as  demonstrated 
In  [9],  since  directories  are  maintained. 

2.2  DBC  Architecture 

The  most  natural  way  of  addressing  Information  In  a database  Is  In  terms 
of  the  content  of  the  records.  However,  the  secondary  storages  of  conventional 
computers  have  so  far  been  limited  only  to  location-addressability.  This 
Implies  that  In  order  to  find  a record  In  the  database,  the  location  of  the 
record  must  first  be  determined  via  software  techniques  and  auxiliary  data 
structures.  The  overhead,  therefore.  Includes  the  con5>lexity  of  software  to 
support  auxiliary  data  structures.  This  overhead  becomes  particularly 
Intolerable  when  the  database  Is  large,  since  the  search  of  the  auxiliary 
structure  Itself  becomes  a time-consuming  process. 

The  DBC  provides  for  the  entire  database  an  on-line  storage  which  can 
be  content-addressed.  Although  associative  memory  also  provides  content- 
addressing, It  Is  not  possible  to  develop  a monolithic  associative  memory 
with  sufficient  capacity  for  DBC  storage.  By  partitioning  the  memory  Into 
blocks,  each  of  which  Is  content-addressable,  and  by  limiting  access  to  only 
one  of  these  blocks  at  a time,  the  DBC  can  achieve  some  degree  of  associativity 
and  very  large  storage  capacity.  Such  a processor  and  memory  organization 
Is  termed  a partitioned  content-addressable  memory  (PCAM) . The  on-line 


mass  memory  (MM)  of  the  DBC  is  a FCAM.  Each  partition  of  the  MM  Is  called 
a minimal  access  unit  (MAU) . As  an  exaiiq>le,  a 10^  byte  database  will  have 
1,000  MAUs  each  of  which  processes  and  stores  10^  bytes,  which  Is  the 
approximate  size  of  a disk  cylinder. 

Another  major  component  of  the  DBC  Is  a processor  called  the  database 
command  and  control  processor  (DBCCP) . When  a command  from  the  front-end 
computer  (the  one  which  Interfaces  with  the  user)  Is  sent  to  the  DBC,  the 
DBCCP  decodes  the  command,  determines  the  MAUs  to  be  searched  In  order  to 
satisfy  the  command.  Issues  appropriate  orders  to  the  MM  and  transfers  data 
to/from  the  front-end  computer. 

Since  a large  database  will  c -ataln  many  MAUs  and  since  only  one  MAU 
can  be  accessed  at  a time.  It  Is  not  practical  to  search  all  the  MAUs  for 
each  search  order.  Hence,  directory  entries  are  made  for  certain  keywords. 
These  keywords  are  called  Type-D  keywords  or  directory  keywords.  A directory 
entry  consists  of  a Type-D  keyword  and  the  numbers  of  the  MAUs  in  which 
records  containing  this  keyword  appear.  Any  query  conjimct  is  expected  to 
have  at  least  one  predicate  consisting  of  a directory  keyword.  Otherwise, 
an  exhaustive  search  of  the  MM  will  be  necessary  to  satisfy  the  query.  In 
addition,  the  clustering  keywords  and  security  keywords  are  treated  as 
Instances  of  Type-D  keywords. 

The  collection  of  all  the  directory  entries  Is  also  stored  In  a PCAM 
with  a capacity  and  processing  speed  that  Is  different  from  the  mass  memory 
PCAM.  This  PCAM  is  known  as  the  structure  memory  (SM) . T5rplcally  directories 
are  of  the  order  of  1%  to  10%  of  the  database.  Therefore,  the  SM  has  a 
capacity  of  10^  to  10^  bytes.  It  Is  estimated  that  a query  conjunct  will 
seldom  have  more  than  20  predicates;  and  a single  MAU  access  will  normally 
satisfy  a query.  Therefore,  the  access  speed  of  the  SM  Is  about  1 millisecond 
which  Is  about  20  times  faster  than  the  time  required  to  access  an  MAU.  Thus, 
In  the  time  required  to  process  a query  in  the  SM,  another  query  may  be 
(Satisfied  by  accessing  an  MAU.  The  relationship  of  SM,  MM  and  DBCCP  is 
ueplcted  In  Figure  2.1. 

The  processors  associated  with  the  MM  have  the  capability  of  returning 
a group  of  records  (satisfying  a query)  In  a sorted  order,  say,  sorted  by 
a given  attribute.  They  can  also  carry  out  certain  set  functions.  In 
particular,  they  can  take  a group  of  records  and  determine  the  minimum, 
maximum,  sum  and  average  of  the  various  values  of  a given  attribute 
considering  all  the  records  In  the  group.  The  number  of  records  satisfying 
a query  can  also  be  counted  by  hardware.  Furthermore,  any  specific  combination 
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of  fields  of  a record  may  be  returned  (on  request)  to  the  user,  rather  than 
the  record  in  Its  entirety. 

2.3  DBC  Commands 


The  front-end  conqputer  communicates  with  the  DBC  by  Issuing  DBC  commands. 
Two  types  of  commands  are  recognized  by  the  DBC:  the  access  commands  and 
the  preparatory  commands.  Access  commands  are  used  to  retrieve.  Insert, 
delete  and  update  DBC  records  In  a file.  Preparatory  co™n«nds  are  Issued  to 
manage  file  Information  and  security  specifications  In  preparation  of 
subsequent  access  commands. 

Preparatory  commnads  consist  of  the  following.  There  Is  a command  to 
open  a database  file  for  access.  A specific  preparatory  command  Is  reserved 
for  Informing  the  DBC  the  Identity  of  the  users  who  have  the  right  to  create 
files.  Prior  to  the  creation  of  a file,  a command  may  be  sent  to  the  DBC 
to  open  the  particular  file  for  creation.  The  command  provides  Information 
on  the  number  of  attributes  the  file  Is  to  have,  the  number  of  MAUs  that 
need  to  be  allocated  Initially  and  the  number  of  MAUs  that  may  be  allocated 
If  the  Initial  allocation  Is  Insufficient.  Separate  commands  are  also  used 
for  specifying  the  attributes  for  a new  file  and  for  providing  Its  security 
descriptors. 

An  access  command  Is  sent  by  a general-purpose  computer  to  the  DBC  to 
perform  a specific  data  manipulation  function.  An  access  command  Is  recognized 
as  being  a retrieve,  delete.  Insert  or  replace  command.  A retrieve  command 
has  the  following  form: 

RETRIEVE  :<f lie  ldentlfler>  [WITH  POINTER] 

[[SORT  BY  <sort  attrlbute>]  / 

[set  function  8peclflcatlon>  [ONLY]]] 

[<fleld  specification  ll8t-l>]  <record  8peclflcatlon-l> 

[CONNECT  ON  <attrlbute-l,attrlbute-2> 

[<fleld  specification  llst-2>]  <record  speclflcatlon-2>] 
where  file  Identifier  refers  to  the  name  of  the  file  on  which  the  retrieval 
operation  Is  to  be  carried  out:  the  WITH  POINTER  clause  specifies  that  the 
response  data  must  be  accompanied  by  Implemntatlon-dependent  pointers;  the 
sort  attribute  specifies  the  attribute  according  to  whose  values  the  DBC 
must  sort  the  response  data  (i.e.,  the  records  or  fields);  the  set  function 
specification  may  be  one  of  COUNT,  AVG,  MAX,  MIN  or  SUM.  COUNT  returns  the 
nund>er  of  data  elements  retrieved,  AVG  coiq>utes  the  average  value (s)  of  the 
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valueCs)  for  each  of  the  field (s)  specified  In  the  field  specification, 
while  MIN  returns  the  minimum  value(s)  of  the  fleld(s)  specified.  The 
function  SUM  computes  the  sum  of  the  values  for  each  of  the  fields  specified. 

The  functions  AVG,  MIN,  MAX  and  SUM  operate  only  on  numeric  fields.  The 
ONLY  clause  Is  used  to  Indicate  that  the  value  of  the  set  function  need 
only  be  returned.  If  the  ONLY  clause  Is  omitted,  then  both  the  set  function 
value  and  the  Indlvldtial  field  values  will  be  returned.  The  field  specifica- 
tion llst-l  and  llst-2  specify  the  names  of  the  attributes  whose  values  are 
to  be  retrieved  as  the  response  data.  The  values  In  the  response  data  can 
be  made  unique  with  respect  to  any  one  field  by  associating  the  prefix 
UNIQUE  with  the  attribute  name  of  the  desired  field.  If  the  field  specification 
lists  are  not  given  In  the  command,  entire  records  will  be  retrieved.  The 
record  specification  Is  either  a query  (as  defined  earlier)  or  an  Inqilementatlon- 
dependent  record  pointer.  A record  retrieved  In  accordance  with  a record 
specification  or  a collection  of  fields  retrieved  In  accordance  with  a field 
specification  list  Is  called  a data  element.  The  CONNECT  ON  clause  specifies 
that  the  data  elements  retrieved  In  response  to  record  speclflcatlon-1  and 
record  speclflcatlon-2  must  be  joined  on  the  attributes  specified  as  arguments 
of  the  CONNECT  ON  clause.  Here  join  means  an  equality  join.  Attrlbute-1 
refers  to  the  connecting  attribute  In  the  data  elements  defined  by  the  field 
specification  llst-l,  while  attrlbute-2  refers  to  the  connecting  attribute 
In  the  data  elements  defined  by  the  field  specification  llst-2.  In  the  case 
when  the  set  function  specification,  the  UNIQUE  option  In  the  set  specification 
llsts^and  the  CONNECT  ON  clause  are  all  specified  In  a single  retrieve  command, 
the  order  of  precedence  Is  as  follows:  First,  the  fields  In  the  field 
specification  lists  (1  and  2)  are  extracted,  then  the  UNIQUE  option  Is 
executed,  next  the  CONNECT  ON  clause  Is  effected  and  finally  the  set  function 
Is  applied  to  the  result. 

The  general  form  of  a delete  command  Is  as  follws: 

DELETE:  <flle  ldentlfler>  [<record  specif lcatlon>] 
where  file  Identifier  Identifies  the  file  on  which  the  deletion  operation  Is 
to  take  place,  and  record  specification  Is  either  a query  or  a pointer.  If 
the  record  specification  Is  omitted,  then  the  entire  file  will  be  deleted. 

The  general  format  of  an  Insert  command  Is  as  follows: 

INSERT:  <record  to  be  lnserted>  [<clusterlng 
condltlons>] 

The  general  format  of  a replace  command  Is  as  follows: 

REPLACE:  <record  speclflcatlon> 

<keywords  for  replacement>  / <new  record  > 
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where  record  specification  is  either  a query  or  a record  pointer.  The 
record  specification  describes  the  record (s)  which  need  to  be  modified.  The 
keywords  for  replacement  are  attribute-value  pairs  which  will  replace 
corresponding  attribute-value  pairs  In  the  records  to  be  modified.  The  new 
record  Is  one  which  will  replace  the  entire  record (s)  defined  by  the  record 
specification.  Only  one  of  the  two  options  may  be  specified  In  a replace 
command.  That  Is,  either  existing  records  are  modified  with  respect  to 
certain  keywords  or  entire  records  are  replaced  by  a new  record. 

The  commands  we  have  discussed  thus  far  are  executed  directly  by  the 
DBG  hardware.  These  commands  deviate  from  conventional  machine  language 
commands  In  the  following  major  wayst  they  are  very  high-level,  they  use 
variable  length  formats  and  they  provide  set-at-a-tlme  access.  In  Sections 
5 and  6,  we  shall  have  occasion  to  use  these  extremely  powerful  commands  In 
executing  transactions  written  In  a relational  data  sublanguage. 
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3.  THE  RELATIONAL  DATA  MODEL 

The  relational  data  model  can  slnpllfy  both  the  conceptual  view  and 
the  user  view  of  a database.  All  relationships  or  connections  among  data 
Items  are  shown  in  the  form  of  mathematical  relations  over  a set  of  domains  [3]. 
The  data  Items  themselves  are  simply  the  values  associated  with  these  domains. 
Thus,  the  number  of  primitives  in  the  relational  model  Is  only  one,  namely, 
the  relation.  This  contributes  to  the  overall  sliiq>llclty  of  the  relational 
data  model. 

Conceptually,  a relation  Is  a table  In  which  each  column  corresponds 
to  a distinct  attribute  and  each  row  corresponds  to  a distinct  entity  or 
tuple.  Each  tuple  Is  distinct  In  the  sense  that  no  two  tuples  In  a relation 
have  Identical  values  for  all  attributes.  The  set  of  possible  values  that 
can  be  assumed  by  an  attribute  Is  called  the  domain  of  that  attribute.  Two 
different  attributes  of  a single  relation  can  have  the  same  underlying  domain. 
For  example,  the  attributes  QUANTITY  and  RUNS-SCORED  assume  values  from  the 
domain  of  natural  numbers.  Finally,  a relation  R Is  a subset  of  the  Cartesian 
product  of  the  domains  associated  with  the  relation's  attributes  Al,A2,...An. 
Such  a relation  Is  denoted  R(A1,A2 An). 

There  are  two  Important  ways  In  which  database  relations  differ  from 
mathematical  relations:  (1)  The  ordering  of  the  values  within  a tuple  of 
a database  relation  Is  Immaterial  If  the  attribute  names  accompany  the 
corresponding  values;  (2)  The  set  of  tuples  that  comprises  a database  relation 
will  normally  change  over  time  as  tuples  are  Inserted,  deleted  or  modified. 

3. 1 Normalized  Relations 

Vhlle  the  relational  model  can  Inherently  be  applicable  to  the  formulation 
of  any  type  of  relation.  It  Is  a conmon  practice  to  subject  the  relations  to 
a process  of  normalization.  The  primary  reason  for  doing  this  Is  to  eliminate 
the  possibility  of  certain  types  of  Inconsistencies  that  may  otherwise  arise 
during  the  update  (such  as,  modification,  deletion.  Insertion)  of  a tuple. 

A fine  treatment  of  the  normalization  process  Is  presented  In  [4]  as  well  as 
In  [10,11,12].  However,  since  normalization  Is  not  central  to  an  understanding 
of  the  manner  In  which  the  relational  data  model  Is  handled  by  the  DBC,  we  shall 
restrict  ourselves,  only  for  the  sake  of  completeness,  to  a brief  description 
of  the  various  normal  forms  In  the  appendix. 
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3,2  SEQUEL;  A Relational  Data  Sublanguage 

A comprehensive  database  management  system  (DBMS)  should  Include 
provisions  such  as  simple  but  flexible  user  views,  data  definition,  data 
manipulation  and  query  capabilities,  as  well  as  convenient  access  support, 
system  recovery  and  integrity  enforcement.  System  R [5]  is  one  such  system 
that  is  based  on  the  relational  data  model.  System  R provides  user  interface 
through  a data  sublanguage  called  SEQUEL  [13],  An  improved  version  of  SEQUEL, 
called  SEQUEL  2 [14],  is  currently  being  used  in  Sytem  R.  The  language 
features  used  in  this  document  are  those  of  SEQUEL  2,  even  though  we  shall 
refer  to  them  simply  as  SEQUEL, 

SEQUEL  is  designed  to  be  used  both  as  a stand-alone  language  for 
interactive  users  and  as  a data  sublanguage  embedded  in  a host  programming 
language  such  as  PL/I.  In  the  latter  case,  the  SEQUEL  statements  in  a program 
are  identified  by  a precompiler  which  replaces  them  with  valid  PL/I  calls  to 
a run-time  module  which  performs  the  desired  function.  SEQUEL,  as  its  name 
suggests  (Structured  English  Query  Language),  provides  extensive  query 
facilities  based  on  English  keywords.  In  addition,  a data  manipulation 
facility  permits  insertion,  deletion  and  update  of  individual  tuples  or  sets 
of  tuples  in  a relational  database.  A data  definition  facility  permits 
definition  of  relations  and  of  various  alternative  views  of  relations.  A 
data  control  facility  permits  each  user  to  authorize  other  users  to  access 
his  data;  it  also  provides  for  assertions  about  data  integrity,  and  for  stored 
transactions  that  may  be  triggered  by  various  events.  The  language  operates 
on  relations  in  first  (or  higher)  normal  form.  We  shall  briefly  illustrate 
the  use  of  SEQUET  by  writing  some  transactions  in  this  language. 

A sample  database,  extracted  from  [14],  is  depicted  in  Figure  3.1.  It 
consists  of  four  normalized  relations  (not  necessarily  in  2NF  or  3NF) . The 
EMP  relation  describes  a set  of  employees,  giving  the  employee  nianber,  name, 
department  number,  job  title,  manager's  employee  number,  salary  and  commission 


Relation 

Attributes 

EMP 

EMPNO , NAME , DNO , JOB , MGR , S AL , COMM 

DEPT 

DNO,DNAME,LOC 

USAGE 

DNO, PART 

SUPPLY 

SUPPLIER, PART 

Figure  3,1 


A Sample  Database 


-13- 


for  each  employee.  The  DEPT  relation  gives  the  department  number,  name  and 
location  of  each  department.  The  USAGE  relation  describes  the  parts  which 
are  used  by  the  various  departments.  The  supply  relation  describes  the  supplier 
companies  from  which  the  various  parts  may  be  obtained. 

The  most  basic  operations  of  the  SEQUEL  language  Involve  the  query 
facilities.  For  example,  to  find  the  names  of  employees  In  Dept.  50,  one 
may  write 

SELECT  NAME 
FROM  EMP 
WHERE  DNO-SO 

The  SELECT  clause  lists  the  attributes  to  be  returned.  If  the  entire  tuple 
Is  desired,  one  may  write  SELECT  *.  The  WHERE  clause  may  contain  any  collection 
of  predicates  which  compare  values  of  attributes  of  a tuple  to  constant  values 
(e.g.,  DNO-50)  or  conq>are  values  of  two  attributes  of  a tuple  with  each  other 
(e.g.,  SAL<C0MM).  The  predicates  may  be  connected  by  AND  and  OR,  and  paren- 
theses may  be  used  to  establish  precedence. 

Data  manipulation  facilities  are  those  facilities  whereby  a user  may 
directly  change  values  In  the  database.  These  facilities  fall  Into  the 
categories  of  Insertion,  deletion,  update  and  assignment.  The  Insertion 
facility  allows  the  user  to  Insert  a new  tuple  or  a set  of  tuples  Into  a 
relation.  Deletion  Is  a process  of  specifying  tuples  to  be  removed  from  the 
database.  The  tuples  are  specified  by  means  of  a WHERE  clause  which  Is 
syntactically  Identical  to  the  WHERE  clause  of  a query.  The  update  features 
of  SEQUEL  are  similar  to  those  for  deletion,  except  that  additional  specifications 
must  be  given  for  the  updates  to  be  made  on  the  selected  tuples.  New  values 
for  updated  attributes  may  be  stated  as  constants,  as  nested  queries  or  as 
expressions  based  on  the  original  values  of  the  attributes.  An  assignment 
statement  allows  the  result  of  a query  to  be  copied  Into  a newly-created 
relation  In  the  database.  The  new  relation  may  then  be  queried,  updated  or 
processed  In  the  same  way  as  any  other  relation.  An  example  of  data  manipulation 
that  Involves  Insertion  of  a single  tuple  Is  Illustrated  below.  To  insert  a 
new  tuple  named  'JONES'  with  employee  number  535  In  department  number  51, 
having  no  other  attributes,  a transaction  may  be  written  as 

INSERT  INTO  EMP (EMPNO, NAME, DNO ) ; 

<535,  'JONES',  51> 

The  data  definition  facilities  of  SEQUEL  enable  users  to  create  and 
drop  relations,  define  alternative  views  of  relations,  and  specify  the  access 
elds  (indexes,  etc.)  to  be  maintained  on  the  database.  For  example,  to  create 
the  DEPT  table  (l.e. , relation)  during  the  process  of  constructing  the  database, 
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one  may  write  | 

CREATE  TABLE  DEPT 

(DNO  (CHAR(2),  NONULL), 

DNAME  (CHAR (12)  VAR),  1 

LOC  (CHAR(20)  VAR))  ’ ' ! 

In  this  example.  It  Is  Indicated  to  the  system  to  create  a relation  (which 
Is  to  be  physically  stored)  with  three  attributes  DNO,  DNAME  and  LOC.  The  * | 

DNO  attribute  of  any  tuple  of  DEPT  Is  not  allowed  to  take  a null  value. 

SEQUEL  data  control  facilities  enable  users  to  control  access  to  their 
data  by  other  users,  and  to  exercise  control  over  the  Integrity  of  data  values. 

The  owner  of  the  EMP  relation,  for  example,  may  use  the  following  SEQUEL 
statement  to  grant  Smith  and  Anderson  the  right  to  read  or  update  the  JOB 
and  DNO  columns: 

GRANT  READ,  UPDATE  (JOB,  DNO)  ON  EMP 
TO  SMITH,  ANDERSON 

3.3  Access  Aids 

System  R consists  of  two  major  coi]q>onents  [5].  The  Relational  Storage 
System  (RSS)  Is  the  storage  subsystem  that  manages  devices,  space  allocation, 
storage  buffers,  transaction  consistency  and  locking,  deadlock  detection, 
backout  and  recovery.  It  also  maintains  Indexes  on  selected  fields  of  stored 
relations,  and  pointer  chains  across  relations.  The  Relational  Data  System 
(RDS)  provides  authorization.  Integrity  enforcement  and  support  for  alternative 
views  of  data.  It  also  supports  the  SEQUEL  language  and  maintains  the  catalogs 
of  external  names,  since  the  RSS  uses  only  system-generated  Internal  names. 

The  RDS  contains  an  optimizer  for  choosing  an  appropriate  path  for  any  given 
request  among  the  paths  supported  by  the  RSS. 

System  R relies  on  the  user  to  specify  the  access  paths  to  be  maintained 
on  the  stored  relations.  Access  paths  Include  Images  and  links.  An  Image 
In  the  RSS  Is  a logical  reordering  of  a relation  with  respect  to  one  or  more 
sort  fields.  It  provides  associative  access  capability.  The  RDS  can  rapidly 
fetch  a tuple  from  an  Image  by  keying  on  the  sort  field  values.  The  RSS 
maintains  each  Image  through  the  use  of  a multi-page  Index  structure.  At  most 
one  Image  on  a relation  may  have  the  clustering  property,  which  means  that 
tuples  which  are  near  each  other  In  the  ordering  of  that  Image  are  stored 
physically  near  each  other  In  the  database. 

Links  are  access  paths  in  the  RSS  which  link  tuples  of  one  relation  to 


r*lat«d  tuples  of  another  relation  through  pointer  chains.  Links  are  always 
eaployed  In  a value-dependent  manner:  the  user  may  specify  that  each  tuple 
of  a relation  may  be  linked  to  all  those  tuples  of  another  relation  that  have 
matching  values  In  some  field (s)  and  that  these  tuples  should  be  ordered  In 
some  value-dependent  way.  Like  an  Image,  a link  may  be  declared  to  have  a 
clustering  property.  In  which  case,  the  tuples  will  be  kept  close  to  the 
neighboring  tuples  In  that  link.  As  an  exaiiq>le  of  Images  and  links,  consider 
the  following.  In  order  to  create  a clustering  Image,  named  IM,  on  the  SAL 
attribute  of  the  EMP  table,  we  may  write 

CREATE  CLUSTERING  IMAGE  IM  ON  EMP (SAL). 

Similarly,  to  create  a non-clustering  link,  called  LK,  which  connects  tuples 
of  DEPT  to  tuples  of  EMP  that  match  on  the  DNO  attribute,  ve  may  write 

CREATE  LINK  LK 
FROM  DEPT (DNO) 

TO  EMP (DNO). 

If  we  are  also  to  order  the  employees  on  the  link  by,* say  JOB  and  SAL,  then 
we  simply  expand  this  SEQUEL  statement  with 
ORDER  BY  JOB, SAL. 

It  must  be  noted  that  the  access  paths  (Images  or  links)  contain  no 
logical  Information  that  cannot  be  derived  from  the  data  values  themselves.  | 

The  user  has  no  explicit  control  over  the  placement  of  tuples  In  Images  and 
links  (except  for  the  ability  to  declare  the  structure  of  an  image  or  link) . 

Neither  can  the  user  use  the  Image  or  link  directly  for  accessing  data.  Links 
and  images  are  used  only  by  the  optimizer  to  choose  optimal  access  paths. 

In  the  next  section  we  shall  demonstrate  how  a relational  database  may 
be  transformed  to  an  equivalent  database  that  can  be  supported  directly  by 
the  DBC. 


4.  DBC  REPRESENTATION  OF  A RELATIONAL  DATABASE 


Given  a conventional  general-purpose  computer  with  location-addressable 
storage.  It  Is  convenient  to  represent  a relational  database  In  the  following 
manner.  A template  Is  maintained  for  each  relation  Indicating  the  name  of 
each  attribute  and  Its  relative  position  within  the  relation  (besides  other 
Information  such  as  the  type  of  values  assiimed  by  an  attribute) . The  relative 
position  Is  a necessary  part  of  attribute  Information  since  each  stored 
tuple  carries  only  the  values  and  not  the  attribute  names  themselves.  All 
the  elements  of  a single  tuple  are  stored  In  a physical  sequence,  l.e., 
contiguously.  In  order  that  tuples  may  be  retrieved  without  having  to  scan 
the  entire  database,  two  kinds  of  auxiliary  Information  are  maintained.  A 
separate  Index  may  be  maintained  for  any  attribute  of  a relation  In  the  form 
of  an  itt-ary  tree.  Any  leaf  of  such  a tree  provides  the  address  to  a stored 
tuple  having  a particular  value  for  the  given  attribute.  Another  type  of 
auxiliary  Information  Is  maintained  directly  within  the  stored  tuples.  The 
tuples  that  are  related  by  the  fact  that  they  have  the  same  value  for  a 
given  attribute  may  be  linked  by  means  of  pointers  (addresses) . These 
auxiliary  Information  are  managed  by  the  system  according  to  the  specifications 
of  the  creator  of  a relation.  The  user  of  the  database  need  not  have  any 
knowledge  about  their  existence. 

In  the  DBC,  all  Information  Is  stored  In  the  form  of  records  that 
consist  of  attribute-value  pairs.  Since  the  secondary  storage  Is  content- 
addressable,  It  Is  not  necessary  to  have  separate  Indexes  or  address -dependent 
pointers  within  the  records.  Since  the  on-line  mass  memory  (MM)  Is  not  a 
monolithic  associative  memory.  It  Is  desirable  that  only  one  MAU  (minimal 
access  unit)  be  accessed  during  the  execution  of  a DBC  command.  In  order 
to  avoid  an  exhaustive  search  of  all  the  MAUs,  the  structure  memory  (SM) 
of  the  DBC  Is  utilized.  With  a proper  choice  of  keywords  to  be  entered  In 
the  structure  emmory.  It  Is  possible  to  restrict  the  search  required  for 
most  queries  to  a single  MAU.  Address-dependent  pointers  are.  In  fact,  not 
needed.  However,  a small  directory  In  the  SM,  automatically  managed  by  the 
DBC,  Is  still  essential  to  the  achievement  of  high  performance. 

4.1  Creation  of  DBC  Records 


A relational  database  Is  represented  In  the  DBC  by  creating  a record 
for  each  tuple.  Since  each  tuple  belongs  to  a certain  relation,  we  represent 
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this  fact  In  the  corresponding  DBC  record  by  means  of  the  attribute-value 
pair 

<RELATION,  relatlon-name>y 

where  RELATION  Is  a built-in  attribute.  With  the  Incorporation  of  such  a 
tuple  into  every  record.  It  will  be  possible  to  retrieve  any  record  based  on 
the  fact  that  It  belongs  to  a particular  relation.  In  response  to  any  query 
Involving  a certain  relation,  the  DBC  will  be  able  to  conduct  a content-search 
and  retrieve  all  those  records  that  belong  to  that  relation  and  that  satisfy 
the  other  conditions  required  by  the  query. 

A relation  (or  table)  may  be  defined  In  SEQUEL  by  means  of  a CREATE 
statement.  For  example,  the  EMP  relation  of  Figure  3 .1  may  be  defined  as 
follows : 

CREATE  TABLE  EMP 

(EMPNO  (INTEGER  , NONULL), 

NAME  (CHAR  (12)  VAR), 

DNO  (CHAR(2)), 

JOB  (CHAR(6)), 

MGR  (INTEGER), 

SAL  (DECIMAL (8, 2)), 

COMM  (DECIMAL(8,2))). 

In  a DBC  environment  (l.e.,  where  a front-end  computer  accesses  a database 
via  a DBC),  the  definition  of  the  table  is  retained  by  the  front-end  computer. 
Whenever  a tuple  Is  to  be  stored  In  the  database,  the  software  Interface  In 
the  front-end  computer  creates  a DBC  record  which  consists  only  of  attribute- 
value  pairs.  Such  a pair  is  created  for  the  relation  name  and  one  attribute- 
value  pair  for  every  column  of  the  relation,  as  shown  below: 

<column-name,  value>. 

Thus,  any  tuple  of  the  EMP  relation  Is  represented  in  the  DBC  by  means  of 
the  following  attribute-value  pairs: 

<RELATION,  EMP> 

<EMPNO,  eiiq)loyee-number> 

<NAME,  eiBployee-name> 

<DNO,  department-code> 

<JOB,  job-code> 

<MGR,  manager-number> 

<SAL,  salary> 

<COMM,  commlsslon>> 

If  any  one  of  the  columns  does  not  have  a corresponding  value  In  any  particular 
tuple,  then  It  Is  not  necessary  to  create  (or  store)  an  attribute-value  pair 
for  that  column.  Thus,  every  DBC  record  representing  an  EMP  tiiple  will  have 
an  attribute-value  pair  for  EMPNO  (since  this  column  always  takes  a non-null 
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value)  , but  It  may  not  have  such  a pair  for  DNO  (If  the  value  for  DNO 
happens  to  be  null) . 

A significant  feature  of  the  data  definition  of  the  relational  model  Is 
Its  flexibility.  Occasionally,  It  becomes  necessary  to  expand  an  existing 
table  by  adding  a new  column  to  It,  e.g.,  to  accommodate  a new  application. 
SEQUEL  allows  columns  to  be  added  to  the  right  side  of  an  existing  table  by 
means  of  an  EXPAND  statement,  which  specifies  the  name  and  data  type  of  the 
new  column.  Existing  tuples  are  considered  to  have  null  values  In  the  new 
column  until  they  are  updated.  This  feature  Is  easily  handled  In  the  DBG 
representation,  since  only  records  representing  the  new  tuples  will  have  an 
attribute-value  pair  corresponding  to  the  new  column.  Old  records,  until 
they  are  updated  by  a transaction,  will  remain  unaltered  Insplte  of  the 
expansion  of  the  table.  For  example,  a change  In  the  definition  of  the  DEPT 
table  may  be  made  by  adding  a column  (NEMPS)  to  keep  track  of  the  number  of 
employees  In  each  department.  This  Is  done  by  the  SEQUEL  statement 

EXPAND  TABLE  DEPT 

ADD  FIELD  NEMPS (INTEGER) . 

In  response  to  this  statement,  only  those  DBG  records  representing  new  tuples 
of  DEPT  will  have  the  extra  attribute-value  pair 
<NEMPS,  number-of-employee8>. 

Notice  that  It  Is  possible  that  a relation  may  have  a column  called 
RELATION.  In  that  case.  It  seems  that  there  will  be  two  attribute-value 
pairs  with  the  same  attribute,  namely, 

<RELATION,  relatlon-name>  and 
<RELATION,  value-of-the-column>. 

This  ambiguity  will  never  actually  occur.  Since  all  the  attributes  are 
coded  and  because  the  built-in  attribute  RELATION  has  been  given  a unique  code, 
it  is  different  from  the  codes  of  all  other  attributes. 

4.2  Access  Aids  and  Glusterlng 

System  R makes  use  of  Images  and  links  to  determine  optimal  access  paths. 
The  DBG,  however,  has  no  use  for  Images  and  links  that  are  not  designated  for 
clustering  purposes.  It  does  not  need  to  Implement  these  access  aids  In  the 
way  System  R does.  This  Is  due  to  the  fact  that  the  DBG  uses  content- 
addressable  memory,  thus  eliminating  the  need  of  pointers.  The  only 
Information  on  Images  and  links  that  the  DBG  will  make  use  of  Is  the 
clustering  Information.  It  should  be  obvious  that  the  intention  In  specifying 
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clusterlng  Images  and  clustering  links  Is  to  physically  gather  together  all 
data  that  will  be  most  frequently  retrieved  In  response  to  a single  query. 

Since  the  DBC  does  not  simultaneously  access  two  or  more  MAUs  It  will  be 
desirable  to  cluster  such  data  In  a single  MAU. 

Clustering  of  the  DBC  records,  however,  always  starts  with  a relation 
name.  We  first  attempt  to  store  In  as  few  MAUs  as  possible  all  those  DBC 
records  that  belong  to  the  same  relation.  The  reason  for  clustering  by 
relation  name  Is  simply  that  all  SEQUEL  queries  Involve  one  or  more  relations. 
Thus,  the  relation  names  being  known.  It  will  always  be  possible  to  form  DBC 
commands  with  the  query  field  consisting  of  at  least  one  predicate  of  the  form 
(RELATION-relatlon-name) . 

In  this  manner  we  will  almost  always  be  ensured  that  the  DBC  will  satisfy  any 
given  query  by  accessing  at  most  a number  of  MAUs  that  is  no  greater  than 
the  number  of  MAUs  required  to  store  all  the  records  in  a given  relation.  The 
actual  number  of  MAUs  accessed  will.  In  fact,  be  usually  less  than  this  number. 
We  observe,  therefore,  that  DBC  records  are  clustered  primarily  by  relation 
name.  This  clustering  Is  done  while  Inserting  each  record,  say,  belonging 
to  relation  r,  by  Indicating  In  the  Insert  command  that  (R£LAT10N°r)  Is  the 
primary  clustering  condition  (which  Is  called  mandatory  clustering  condition 
In  DBC  terminology).  Further  clustering  Is  done  based  on  the  Information 
on  the  access  aids  (Images  and  links),  as  we  shall  now  discuss. 

For  records  to  be  Inserted  In  the  database,  there  Is  a 
secondary  clustering  condition  (called  optional  clustering  condition  In  DBC 
terminology)  derived  from  the  specification  of  the  access  aids.  There  are 
no  secondary  clustering  conditions  for  records  of  any  relation  for  which  no 
clustering  Image  or  clustering  link  has  been  specified  In  the  definition 

I 

of  the  database.  We  assume.  In  keeping  with  the  language  definition  of 
SEQUEL,  that  there  Is  no  more  than  one  clustering  Image  or  clustering  link 
defined  on  any  relation.  If  there  are,  then  we  may  arbitrarily  pick  one  of 
them  (In  determining  a secondary  clustering  condition)  since  It  normally 
does  not  pay  to  have  more  than  two  levels  of  clustering  In  a PCAM  (partitioned 
i content-addressable  memory)  with  large  partitions.  Assuming  that  we  have 

decided  on  a particular  Image  or  link  for  clustering  purposes,  let  us 
illustrate  what  clustering  conditions  are  to  be  provided  by  the  front-end 
computer  to  the  DBC. 
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A.  Use  of  Clustering  Links 


Considering  the  general  definition  of  a clustering  link,  namely, 

CRBATE  CLUSTERING  LINK  link-name 
FROM  relatlon-l(attrlbute-llst-l) 

TO  relatlon-2 (attrlbute-llst-2) 

ORDER  BY  attrlbute-llst-3 

we  are  to  link  every  record  of  relatlon-1  to  one  or  more  records  of  relatlon-2 
such  that  attrlbute-llst-1  and  attrlbute-llst-2  have  matching  values.  Since 
records  of  two  different  relations  cannot  be  clustered  (because  the  primary 
clustering  condition  Is  based  on  a single  relation  name) ,we  only  cluster  all  those 
records  of  relatlon-2  that  belong  to  the  same  link.  Thus,  usually  a maximum 
of  two  accesses  will  be  necessary  to  access  all  records  that  logically  belong 
to  the  same  link;  one  access  for  the  records  belonging  to  relatlon-1 
and  another  access  for  the  records  belonging  to  relatlon-2.  We  cluster 
relatlon-2,  therefore,  by  attrlbute-llst-2;  and  this  Is  done  as  follows: 

Assume  that  the  tuples  of  relatlon-2  together  occupy  upto  N MAUs  (In  case 
no  knowledge  Is  available  as  to  the  tuple  size  and  number  of  tuples  In  a 
relation,  assume  N to  be  a large  number,  say  200,  which  Is  likely  to  be  large 
enough  to  accommodate  any  relation).  In  any  tuple  (of  relatlon-2)  to  be 
stored  In  the  database,  add  an  extra  attribute-value  pair 
<CLUSTER,  hash  number>, 

where  hash  number,  between  1 and  N,  Is  obtained  by  hashing  the  values  of 
attrlbute-llst-2.  The  secondarj  clustering  condition  for  this  tuple,  then. 

Is  (CLUSTER  = hash  number).  The  primary  clustering  condition,  we  may  recall. 

Is  simply  (RELATION  = reIatlon-2). 

B.  Use  of  Clustering  Images 

The  Implementation  of  the  clustering  properties  of  an  Image  Is  only  a 
little  more  complex.  The  added  complexity  arises  due  to  the  fact  that  an 
Image  Is  always  ordered  by  one  or  more  attributes.  Let  us  Illustrate  the 
liq)lementatlon  process  by  considering  the  general  definition  of  a clustering 
Image: 

CREATE  CLUSTERING  IMAGE  Image-name 
ON  relatlon-1  (attribute-list) 

Assume  that  the  attribute-list  consists  of  the  attributes  A1,A2,. . . ,An. 

Then  the  Image  Is  to  be  ordered  logically  (as  well  as  physically,  since  It  Is 
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a clustering  Image)  first  by  Al,  secondly  by  A2,  etc. 

If  the  possible  values  of  attribute  A1  have  a reasonably  wide  range, 

then  It  Is  clear  that  clustering  by  A2 An  (after  clustering  by  Al)  will 

give  no  added  advantage.  For  example.  If  the  values  of  Al  were  to  range 

uniformly  between  1 and  1000  and  the  number  of  MAUs  occupied  by  relatlon-1 
Is  20,  then  we  may  be  able  to  cluster  In  such  a way  that  records  with  Al- 
values  ranging  between  1 and  50  are  stored  In  the  first  of  the  20  MAUs,  records 
with  Al-values  ranging  between  51  and  100  are  stored  In  the  second,  etc.  Further 

clustering  by  attribute  A2  serves  no  purpose  because  In  any  search  query  with 

equality  predicates  Involving  attributes  Al  and  A2,  the  value  of  Al  alone 
(together  with  relation  name)  Is  sufficient  to  determine  the  MAU  that  contains 
all  the  records  satisfying  the  query. 

On  the  other  hand,  if  the  possible  values  of  attribute  Al  have  a very 
small  range  (e.g. , the  attribute  SEX  may  take  values  either  male  or  female)* 
then  It  Is  advantageous  to  use  attribute  A2  for  creating  finer  (smaller) 
clusters  that  may  be  accommodated  In  a single  MAU.  Similarly,  If  each  of 
Al  and  A2  has  a small  range  of  permissible  values,  then  attribute  A3  may 
be  used  to  create  finer  clusters. For  clustering  purposes,  therefore,  we 

use  attribute  Al Al  such  that  each  of  Al, . . . ,A(1-1)  assume  a small  range 

of  values  and  either  l'»n  or  Al  assumes  a large  range  of  values. 

To  determine  the  cluster  number  of  any  tuple,  the  front-end  computer 
makes  use  of  a table  for  each  relation.  There  are  as  many  colximns  In  a 
table  as  there  are  attributes  chosen  for  clustering  (as  discussed  In  the 
last  paragraph) . The  entries  in  each  column  have  the  following  meaning: 

Entry  1:  Name  of  the  clustering  attribute. 

Entry  2:  Range  of  values  of  the  attribute; 

It  Is  either  large  or  small  . 

Entry  3:  Integer  representing  the  following: 

If  Entry  2 •>  small, 

then  Entry  3 * number  of  values  in  the  range. 

If  Entry  2 - large, 

then  Entry  3 ■ number  of  partitions 
made  of  the  range. 

Entries  4,5,...:  Values  of  the  attribute's  range, 

or  maximum  values  for  the  partitions  of  the  range. 

As  an  example,  consider  a POPULATION  relation,  some  of  whose  attributes  are 

SEX:  male  or  female 

STATUS:  employed,  ineligible  or  i<nemployed 


I 


-22- 


AGE:  any  integer. 

If  this  relation  Is  to  be  clustered  by  SEX,  STATUS  and  AGE  respectively,  then 
we  may  create  a table  as  shown  In  Figure  4.1.  There  are  2*3*5  « 30  clusters 
for  this  relation  as  shown  In  Figure  4.2.  The  number  of  partitions  chosen 
for  the  attribute  AGE  Is  5.  This  choice  should  not  be  arbitrary  but  should 
be  based  on  the  fact  that  the  total  number  of  clusters  required  Is  equal 
to  or  somewhat  larger  than  (say,  double  or  triple)  the  number  of  MAUs 
occupied  by  the  relation.  In  case  the  size  of  a relation  (In  terms  of  MAUs 
required)  Is  unknown,  then  a suitably  large  number  Is  assumed,  say  200. 

The  creator  of  a clustering  Image  may  provide  the  range  Information 
on  the  various  attributes  by  statements  such  as 
RANGE  OF  SEX  IS  (MALE,  FEMALE) 

RANGE  OF  AGE  IS  INTEGER  (SMALLEST-0 ,LARGEST-80) 

RANGE  OF  NAME  IS  ALPHA  (SMALLEST-'  LARGEST- ' ZZZ ' ) 

where.  In  the  case  of  Integer,  floating-point  or  alphanumeric  attributes, 
the  usual  range  Is  also  specified*  Any  range  declared  to  be  Integer, 
alphanumeric  or  floating-point  Is  considered  a large  range.  Any  range  that 
Is  actually  listed  out  (e.g.,  male,  female)  is  considered  small.  This  RANGE 
specification  Is  currently  not  a part  of  the  SEQUEL  CREATE  statement  but  can 
easily  be  Incorporated  In  the  language  definition  In  order  to  facilitate 
clustering  on  a partitioned  content-addressable  memory. 

Once  a cluster  number  C Is  determined  for  any  record  of  relation  R, 
then  a special  ke3word  <CLUSTER,  C>  Is  Included  as  part  of  the  stored  record. 
The  primary  and  secondary  clustering  conditions  for  this  tuple,  then,  are 
(RELATION  - R)  and  (CLUSTER  - C) , respectively. 

In  this  section,  we  have  considered  the  DBC  representation  of  a relational 
database.  Clustering  of  the  DBC  records  has  been  considered  as  part  of  the 
over-all  clustering  problem;  It  has  been  shown  how  an  extra  keyword  with 
attribute  CLUSTER  Is  created  and  stored  In  each  record.  In  the  next  section 
we  shall  Illustrate  how  SEQUEL  queries  are  transformed  Into  DBC  commands  . 
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Attribute 

SEX 

STATUS 

AGE 

Range 

Small 

Small 

Large 

2 

a 

3 

5 

Values  or 

vdliiA 

female 

employed 

20 

male 

Ineligible 

40 

unemployed 

60 

80 

Flgure  4.1.  Table  created  for  a clustering  Image 


Cluster  Number 

SEX 

STATUS 

AGE 

1 

female 

employed 

<20 

2 

• 

female 

• 

eiiq>loyed 

<40 

• 

« 

5 

• 

• 

female 

• 

employed 

• 

• 

<lnflnlty 

6 

« 

female 

• 

Ineligible 

• 

<20 

« 

• 

10 

female 

• 

Ineligible 

• 

<lnflnlty 

11 

• 

• 

female 

• 

unemployed 

• 

<20 

• 

• 

15 

• 

female 

• 

unemployed 

<lnflnlty 

16 

• 

male 

• 

employed 

• 

<20 

• 

30 

• 

male 

unemployed 

• 

<lnflnlty 

Figure  4.2.  Cluster  numbers  corresponding  to  various  values  of 
SEX,  STATUS  and  AGE 
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5.  THE  TRANSLATION  PROCESS 

The  DBC,a8  we  have  indicated  earlier,  acts  as  a bach-end  machine 
executing  commands  given  by  a front-end  computer.  Let  us  call  the  software 
package .which  resides  in  the  front-end  computer  and  which  creates  and  handles  a 
relational  database  stored  in  the  DBG,  the  RDBI  (Relational  Database 
Interface).  The  RDBI  Intercepts  the  data  sublanguage  (in  this  case,  SEQUEL) 
statements,  which  are  part  of  a host  language  program,  and  translates  them 
into  a series  of  DBC  commands. 


5.1  Translation  of  SEQUEL  Queries 


We  shall  Illustrate  the  process  of  translation  from  SEQUEL  statements 
to  DBC  commands  by  means  of  a series  of  examples.  Our  sample  database  is 
that  of  Figure  3.1.  For  convenience,  we  will  reproduce  here  the  four 
relations  in  the  database: 

Relation  Attributes 

EMP  EMPNO, NAME, DNO,JCB,MGR. SAL, COMM 

DEPT  DNO,DNAME,LOC 

USAGE  DNO,PART 

SUPPLY  SUPPLIER, PART 

In  each  example,  first  the  SEQUEL  statement  and  then  the  corresponding 
DBC  commands  are  shown. 

F.TampIp  1 ; The  following  SEQUEL  statement  and  its  equivalent  DBC  command  will 

find  the  names  of  employees  in  Dept.  50. 

SEQUEL: 

SELECT  NAME 
FROM  EMP 
WHERE  DNO-50 

DBC  Command: 

RETRIEVE:  (NAME)  ((RELATION-’ EMP' )&(DNO-50)) 

K-rampTe  2»  A list  of  all  the  different  department  numbers  in  the  EMP  table 
is  created  by  the  following  statement  or  commands. 

SEQUEL: 

SELECT  UNIQUE  DNO 
FROM  EMP 

DBC  Command: 

RETRIEVE:  (UNIQUE  DNO) (RELATION-' EMP' ) 


Exa^leS:  To  list  the  names  of  eii5>loyees  in  departments  25,  47  and  53. 
the  following  statement  may  be  used. 
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SEQUEL: 

SELECT  NAME 
FROM  EMP 

WHERE  DNO  IN  (25,47,53) 

DBC  Command: 

RETRIEVE : (NAME) ( ( (RELATION- ' EMP  * ) & (DNO-25 ) ) 

V ( (RELATION- ' EMP ' ) & (DNO-4  7 ) ) 

V((RELATION-»EMP' ) & (DNO-53) ) ) 

Example  4:  Consider  listing  the  names  of  employees  who  work  for  departments 

In  Evanston.  This  type  of  transaction  requires  access  to  two  different 

relations  and  Is,  therefore,  expressed  In  SEQUEL  by  means  of  a nested 

SELECT  statement.  The  Inner  part  of  the  nesting  returns  the  collection  of 

DNO  values  of  the  departments  located  In  Evanston.  The  outer  part  then 

proceeds  as  though  It  were  given  a set  of  constants  In  lieu  of  the  Inner 

SELECT  clause. 

SEQUEL: 

SELECT  NAME 
FROM  EMP 
WHERE  DNO  IN 

SELECT  DNO 

FROM  DEPT 

WHERE  LOO 'EVANSTON* 

DBC  Commands: 

a.  RETRIEVE : (DNO) ( (RELATION- 'DEPT* ) & (LOG- 'EVANSTON ’ ) ) 

For  each  department  number  'dl*  retrieved  by  (a),  the  KDBI  Issues  the 
DBC  command: 

b . RETRIEVE : (NAME) ( (RELATION- ' EMP ' ) & (DNO- ' dl ' ) ) 


Exanple  5:  The  employee  ntindjer,  name,  and  salary  of  employees  In  Dept.  50 
may  be  listed  as  follows.  In  the  order  of  employee  number. 

SEQUEL: 

SELECT  EMPNO, NAME, SAL 
FROM  EMP 
WHERE  DNO50 
ORDER  BY  EMPNO 

DBC  Command: 

RETRIEVE:  (EMPNO, NAME, SAL) 

( (RELATION-  ’ EMP » ) & (DNO50)  ) 

SORT  BY  EMPNO 

In  case  the  ordering  of  the  response  set  Is  to  be  done  also  by  some 
secondary  attributes,  then  such  a sorting  Is  done  by  the  RDBI  In  the 
front-end  computer.  The  DBC  only  sorts  by  a single  attribute. 


Example  6;  An  Important  class  of  queries  Is  exemplified  In  the  determination 
of  average  salary  of  clerks.  The  built-in  SEQUEL  function  AVG  can  be  used 
to  accomplish  this  result.  Other  built-in  functions  In  the  SEQUEL  language 
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are  SUM,  COUNT,  MAX  and  MIN.  These  functions  are  Indeed  part  of  the  hardware 
OBC  features . 

SEQUEL: 

SELECT  AVG(SAL) 

FROM  EMP 
WHERE  JOB- 'CLERK' 

DBC  Command: 

RETRIEVE: (AVG(SAL) ) ( (RELATION- ' EMP ' ) & (JOB- ' CLERK ' ) ) 


Example  7:  The  following  statement  determines  the  count  of  all  the  different 
jobs  held  by  employees  In  Dept.  50. 

SEQUEL: 

SELECT  COUirr  (UNIQUE  JOB) 

FROM  EMP 
WHERE  DN(^50 

DBC  Command: 

RETRIEVE:  [COUNT  ONLY] (UNIQUE  JOB) 

( (RELATION- ' EMP ' ) & (DNO-50) ) 

Example  8:  In  addition  to  simple  attributes  and  built- In  functions,  SEQUEL 
allows  a user  to  construct  arithmetic  expressions  In  the  SELECT  clause.  All 
the  following  are  valid  SEQUEL  expressions: 

AVG(SAL)/52 

AVG (SAL)+AVG (COMM) 

MIN(SALfCOMM) 

Since  the  DBC  does  not  have  any  arithmetic  capabilities,  all  arithmetic 
operations  are  done  by  the  RDBI  In  the  front-end  computer.  For  exan^>le, 
to  execute  the  SEQUEL  statement  for  the  first  expression,  the  RDBI  sends 
one  command  to  the  DBC  to  retrieve  AVG(SAL).  It  then  divides  the  resulting 
number  by  52.  For  the  second  expression,  two  DBC  comnands  are  required 
to  retrieve  two  numbers,  which  are  then  added.  For  the  third  expression, 
a single  DBC  command  Is  required  to  retrieve  the  SAL  and  COMM  fields  of  each 
EMP  record;  the  addition  and  MIN  operations  are  then  performed  by  the  RDBI . 

Example  9:  Consider  listing  all  the  departments  and  the  average  salary  of 
each.  This  Is  an  example  of  a query  In  which  a relation  needs  to  be 
partitioned  Into  groups.  A bullt-ln  function  can  then  be  applied  to  each 
group. 

SEQUEL: 

SELECT  DNO,AVG(SAL) 

FROM  EMP 
GROUP  BY  DNO 

DBC  Comiands: 

a.  RETRIEVE: (UNIQUE  DNO) (RELATION- ' EMP ' ) 

For  each  department  number  'dl'  retrieved  by  (a),  the  RDBI  Issues  a command; 


b . RETRIEVE ; (AVG (SAL) ) ( (RELATION- ' EMP ' ) & (DNO- ' dl ' ) ) 

Example  10;  Sometimes  It  may  be  desired  to  partition  a relation  Into  groups 

and  then  to  apply  a predicate  or  a set  of  predicates  which  chooses  only  some 

of  the  groups  and  disqualifies  others.  These  group-qualifying  predicates 

are  placed  In  a special  HAVING  clause.  A predicate  In  a HAVING  clause  may 

compare  an  aggregate  property  (e.g. , AVG(SAL))  of  a group  to  a constant  or 

to  another  aggregate  property  of  the  same  group.  The  following  SEQUEL 

statement  may  be  used  to  list  all  those  departments  In  which  the  average 

employee  salary  Is  less  than  10,000. 

SEQUEL: 

SELECT  DNO 
FROM  EMP 
GROUP  BY  DNO 
HAVING  AVG(SAL)<10000 

DBC  Commands : 

a.  RETRIEVE:  (UNIQUE  DNO) (RELATION- 'EMP') 

For  each  department  number  'dl'  retrieved  by  (a),  the  RDBI  Issues  a 
command: 

b . RETRIEVE ; (AVG (SAL) ) (RELATION- ' EMP ' ) & (DNO- ' dl ' ) ) 

Since  the  DBC  does  not  make  comparisons  on  aggregate  properties,  the  final 
selection  of  DNO  based  on  (AVG(SAL)<10000)  Is  done  by  software  (l.e.,  by 
the  RDBI)  In  the  front-end  computer. 

Example  11:  When  a query  has  both  a WHERE  clause  and  a HAVING  clause,  then 

the  WHERE  clause  has  precedence  since  It  Is  applied  to  qualifying  tuples. 

In  contrast  to  the  HAVING  clause  which  la  applied  to  groups  of  tuples.  Use 

Is  made  of  both  these  clauses  In  listing  the  departments  which  employ  more 

than  ten  clerks. 

SEQUEL: 

SELECT  DNO 
FROM  EhO^ 

WHERE  JOB- 'CLERK' 

GROUP  BY  DNO 
HAVING  COUNT (*)>10 

The  notation  COUNT (*)  denotes  the  count  of  tuples  In  a group. 

DBC  Commands: 

a.  RETRIEVE: (UNIQUE  DNO) (RELATION- 'EMP ' ) 

For  each  department  number  'dl'  retrieved  by  (a),  the  following  command 
Is  Issued: 

b.  RETRIEVE: [COUNT  ONLY] 

( (RELATION- ' EMP ')& (DNO- ' dl ')& (JOB- ' CLERK ') ) 

The  RDBI  now  returns  the  name  of  only  those  departments  for  which  a count 
of  greater  than  10  Is  retrieved  by  (b) . 


Example  12;  Set  comparison  operators  like  ■,  i<,  [IS]  [NOT]  IN,  CONTAINS  and 

DOES  NOT  CONTAIN  are  allowed  In  a HAVING  clause  as  Illustrated  by  this  example, 

which  lists  the  departments  which  have  employees  with  every  possible  job  title. 

SEQUEL: 

SELECT  DNO 
FROM  EMP 
GROUP  BY  DNO 
HAVING  SET (JOB)- 
SELECT  JOB 
FROM  EMP 

DBC  Commands: 

a.  RETRIEVE : (UNIQUE  DNO) (RELATION- ' EMP ' ) 

b.  RETRIEVE: (UNIQUE  JOB) (RELATION- 'EMP ' ) SORT  BY  JOB 

For  every  department  number  'dl'  retrieved  by  (a).  Issue  the  command: 

c.  RETRIEVE: (UNIQUE  JOB) 

((RELATION-’EMP')&(DNO-'dl'))  SORT  BY  JOB 

For  each  department,  the  comparison  of  each  of  the  sets  In  (c)  to  the  set 
In  (b)  Is  done  by  software  (l.e.,  by  the  RDBl). 

Example  13:  The  set  theoretic  operators  INTERSECT,  UNION  and  MINUS  are  also 

available  In  SEQUEL.  Consider,  for  example,  the  listing  of  all  the  departments 

which  have  no  employees. 

SEQUEL: 

SELECT  DNO 
FROM  DEPT 
MINUS 
SELECT  DNO 
FROM  EMP 

DBC  Commands: 

a.  RETRIEVE: (UNIQUE  DNO) (RELATION- 'DEPT' ) 

b.  RETRIEVE:  (UNIQUE  DN0HRELATI0N-*EMP' ) 

The  set  operation  MINUS  Is  now  done  by  the  RDBI. 


Example  14:  A Join  operation  may  be  required  to  return  values  selected  from 
more  than  one  relation.  The  names  of  all  employees  and  the  locations  where 
they  work  may  be  listed  by  the  query: 

SEQUEL: 

SELECT  EMP. NAME, DEPT. LOC 

FROM  EMP, DEPT 

WHERE  EMP.DNO-DEPT.DNO 

DBC  Command: 

RETRIEVE: (NAME, DNO) (RELATION- 'EMP ' ) 

CONNECT  ON  (DNO, DNO) 

(LOC,DNO) (RELATION- 'DEPT* ) 

Here,  there  are  two  field  specification  lists:  (NAME, DNO)  for  the  first 
query  and  (LOC,DIr  for  the  second  query.  The  connand  Is  to  connect  (Join) 
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on  the  two  DNO  attributes  and  return  as  response  data  triples  of  the 
form  (NAME, DNO, LOG ) , where  NAME  Is  taken  from  the  first  field  specifications 
list,  LOG  is  taken  from  the  second  list,  and  DNO  is  common  to  both.  The 
RDBI  now  returns  to  the  user  only  the  pairs  (NAME, LOG)  by  deleting  DNO 
from  the  triples  returned  by  the  DBG. 

Example  15;  In  some  circumstances,  it  is  necessary  to  join  a relation  with 
Itself  according  to  some  criterion.  The  relation  name  may  then  have  to  be 
listed  more  than  once  and  labelled,  e.g.,  X and  Y may  be  two  labels  for  a 
relation  EMP,  As  an  example,  the  following  SEQUEL  query  will  list  the 
enqjloyee's  name  and  his  manager’s  name  for  each  employee  whose  salary  exceeds 
his  manager's  salary. 

SEQUEL: 

SELEGT  X.NAME.Y.NAME 
FROM  EMP  X,  EMP  Y 
VfflERE  X.MGR=Y.EMPNO 
AND  X.SAL>Y,SAL 

DBG  Gommand : 

a.  RETRIEVE : (MGR) (RELATION=  * EMP ' ) 

GONNEGT  ON  (MGR,EMPNO) 

(EMPNO) (RELATION* ' EMP ' ) 

The  only  difference  between  this  command  and  the  command  for  Example  14 

is  that  only  one  attribute  is  returned,  instead  of  X.NAME  and  Y.NAME  as 

well.  This  is  because  the  AND  clause  has  still  got  to  be  considered.  i 

Notice  that  since  a manager  has  at  least  one  employee  (in  general) , a : 

modified  command  (a')  would  also  have  the  same  effect  as  (a),  yet  taking  j 

i 

less  time  to  execute.  However,  (a')  is  not  general  enough  for  all  situations.  j 

a' , RETRIEVE: (UNIQUE  MGR) (RELATION- 'EMP ' ) | 

For  each  manager  number  'mi*  returned  by  (a),  do  the  following:  Send  ] 

a command  j 

b.  RETRIEVE: (NAME,SAL) ((RELATION-* EMP’ )&(EMPNO=' ml')) 
and  for  each  (nj,sk)  pair  returned  by  (b),  send  a command 

c.  RETRIEVE; (NAME) ((RELATION-*EMP*)&(MGR-*mi')&(SAL>sk)) 

Notice  that  the  name  retrieved  by  (c)  is  an  employee  name,  and  that 
returned  by  (b)  is  the  corresponding  manager’s  name. 

Steps  (b)  and  (c)  have  been  written  in  such  a way  that  for  every 
manager,  the  DBG  accesses  all  his  employees  at  the  same  time.  These 
two  steps  could  otherwise  have  been  written  such  that  for  every  employee, 
the  DBG  accesses  all  his  managers  at  the  same  time.  But,  of  course,  every 
employee  has  a single  manager.  Therefore,  the  way  we  have  written  the 
commands  is  better  than  its  alternative,  since  fewei  number  of  accesses 
is  required  in  the  former  case.  The  decision  is  mad®  on  the  basis  of 
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the  fact  that  there  are  fewer  unique  values  of  MGR  than  there  are  of  EMPNO. 


Example  16:  SEQUEL  permits  a label  to  be  used  to  qualify  attribute  names 
outside  the  block  In  which  the  label  Is  defined.  The  following  query  uses 
this  feature  In  listing  the  suppliers  who  supply  all  the  parts  used  by 
Dept.  50. 

SEQUEL: 

SELECT  SUPPLIER 
FROM  SUPPLY  X 
WHERE 

(SELECT  PART 

FROM  SUPPLY 

WHERE  SUPPLIER=*Xo  SUPPLIER) 

CONTAINS 

(SELECT  PART 
FROM  USAGE 
WHERE  DNO-50) 

DBC  Command: 

a.  RETRIEVE: (UNIQUE  SUPPLIER) (RELATION- ' SUPPLY ' ) 

b . RETRIEVE : (PART) (RELATION- * USAGE ' ) & (DNO-50) ) 

Since  the  block  after  CONTAINS  has  a comparison  Involving  a constant, 

It  needs  to  be  executed  only  onceo  This  Is  done  by  command  (b)  given 
above.  For  each  supplier  'si*  retrieved  by  (a),  a DBC  command 

c.  RETRIEVE: (PART) ((RELATION- 'SUPPLY* )& (SUPPLIER- 'SI*)) 

la  sent,  and  the  sets  retrieved  by  (b)  and  (c)  are  compared  by  software. 

The  same  query  could  have  been  made  In  SEQUEL  by  means  of  GROUP  BY 

and  the  special  function  SET,  as  given  below: 

SELECT  SUPPLIER 
FROM  SUPPLY 
GROUP  BY  SUPPLIER 
HAVING  SET (PART)  CONTAINS 
SELECT  PART 
FROM  USAGE 
WHERE  DNO-50 

The  DBC  commands  would  be  the  same  as  before. 


Exa]iq>le  17:  As  a final  example,  consider  listing  the  names  of  employees  who 

have  the  same  job  and  salary  as  those  of  Smith.  This  Is  done  as  follows: 

SEQUEL: 

SELECT  NAME 
FROM  EMP 
WHERE  <JOB,SAL>  = 

SELECT  JOB, SAL 

FROM  EMP 

WHERE  NAME-* SMITH* 

DBC  Commands: 

a.  RETRIEVE : (JOB, SAL) ( (RELATION- * EMP  * ) & (NAME- 'SMITH ') ) 

For  the  tuple  (J,s)  retrieved  by  (a),  a second  DBC  command  Is  sent: 

b,  RETRIEVE: (NAME) ( (RELATION- 'EMP ' ) & (JOB- ' j * ) & (SAL-*s ' ) ) 


-31- 


5.2  Use  of  the  Clustering  Information 

Through  an  extensive  list  of  examples,  we  have  tried  to  Indicate  In 
the  last  section  some  of  the  principles  Involved  In  translating  SEQUEL 
queries  to  DBG  commands.  The  relation  name  associated  with  a query  Is  seen 
to  be  a part  of  all  DBG  commands,  thereby  making  It  possible  to  limit  the 
search  to  at  most  as  many  MAUs  as  are  required  to  store  the  entire  relation. 
This  Is  because  the  DBG  records  are  clustered  primarily  by  the  relation 
names.  Not  all  record-contiguity  Information,  however,  has  been  made  use 
of  In  formulating  the  DBG  commands. 

As  we  might  recall  from  Section  4,  the  definition  of  a relational 
database  consists  of,  besides  other  things,  a specification  of  certain  access 
aids  such  as  the  Images  and  links.  Furthermore,  some  of  these  access  aids 
are  declared  to  be  applicable  for  clustering  purposes.  There  can  be  no 
more  than  one  clustering  Image  or  clustering  link  for  any  relation.  We 
have  shown  In  Section  4,  how  a special  keyword  with  attribute  GLUSTER  Is 
created  for  each  DBG  record  that  represents  a tuple  of  a relation  having 
a clustering  Image  or  clustering  link.  This  keyword  Is  created  during  the 
time  of  Insertion  of  a new  record  Into  the  database. ' Thereafter,  whenever 
records  are  to  be  retrieved  In  response  to  a SEQUEL  query,  the  special 
keyword  may  be  regenerated  and  used  as  a predicate  In  the  corresponding  DBG 
query.  This  will  be  possible,  however,  only  when  the  SEQUEL  query  consists 
of  comparisons  Involving  an  attribute  that  Is  one  of  the  attributes  In  a 
clustering  Image  or  clustering  link. 

Given  a SEQUEL  query,  the  DBG  commands  are  created  In  two  steps: 

1.  The  DBG  commands  are  Initially  created  as  shown  In  all  our  earlier 
examples. 

2.  For  each  DBG  command  thus  generated,  whenever  possible,  the  special 
keyword  <GLUSTER, cluster  number>  Is  computed  and  Included  as  a 
predicate  In  the  DBG  command.  The  modified  command  Is  finally  sent 
to  the  DBG  for  actual  execution. 

Gomputatlon  of  the  special  keyword  Is  quite  straightforward.  If  there  is 
a clustering  link  on  attributes  Al,A2,...,An  and  If  these  same  attributes 
occur  In  a DBG  command  (created  In  the  first  step  given  above)  In  the  form 
of  eqxuillty  predicates,  then  the  values  (of  the  attributes)  are  hashed  to 
generate  a cluster  number.  The  hashing  algorithm  Is  the  same  as  the  one 
used  during  the  record-insertion  process.  As  an  example,  consider  that 
there  Is  a clustering  link  as  defined  below: 


CREATE  CLUSTERING  LINK  L5 
FROM  DEPT(DNO) 

TO  EMP (UNO) 

Then  the  EMP  relation  Is  clustered  secondarily  by  DNO.  Now  ar'sume  that  there 
Is  a SEQUEL  query 

SELECT  EMPNO, NAME, JOB 
FROM  EMP 
WHERE  DNO-50 

This  query  Is  translated  to  DBC  comnands  In  two  steps: 

Step  1.  The  Initial  DBC  command  generated  Is 

RETRIEVE: (EMPNO, NAME, JOB) 

( (RELATION- ' EMP ' ) & (DNO-50) ) 

St'ip  2.  The  department  number  5G  Is  hashed  to  compute  a cluster 

number  1.  The  modified  DBC  command  now  generated  Is: 

RETRIEVE: (EMPNO, NAME, JOB) 

( (RELATION- ' EMP ' ) & (DHC-SO) & (CLUSTER- ' 1 ' ) ) 

In  the  case  of  a clustering  Image,  the  clustering  table  (discussed  In 

Section  4)  used  during  the  process  of  record  Insertion  Is  again  used  for 

retrieval  purposes.  For  example,  consider  the  clustering  image  as  defined 

below: 

CREATE  CLUSTERING  IMAGE  13 
ON  EMP (DNO, SAL) 

Let  the  possible  department  numbers  be  D100,D200,D300  and  D400.  Let  us 
further  assume  that  the  SAL  attribute  (which  has  a very  large  range)  is 
partitioned  Into  five  subranges.  Then  the  clustering  table  may  look  like 
as  shown  in  Figure  5,1,  There  are  a total  of  20  clusters  for  the  EMP 
relation  as  shown  In  Figure  5.2,  Now  assume  that  there  Is  a SEQUEL  query: 

SELECT  EMPNO 
FROM  EMP 
WHERE  DNO-*D200' 

AND  COMIt>5000 

Then  the  DBC  command  Is  generated  In  the  following  steps: 

Step  1,  Generate  Initial  command 
RETRIEVE: (EMPNO) 

( (RELATION- ' EMP ' ) & (DNO- 'D200 ' ) & (COMM>5000) ) 

Step  2.  Since  DNO-’D200’  corresponds  to  the  five  cluster  numbers 
6, 7, 8, 9 and  10,  the  modified  DBC  command  Is 

RETRIEVE: (EMPNO) 

( (RELATION- ' EMP ' ) & (DNO- ’ D200 ' ) & (COMM>5000) 

& (CLUSTER>6) & (CLUSTERSIO) ) 


and  SAL  attributes  of  the  EMP  relation 


-34- 


I 


5.3  Translating  the  Data  Manipulation  Statements 

The  data  manipulation  facilities  of  SEQUEL  allow  the  user  to  directly 
change  values  In  the  database.  Using  these  facilities,  the  user  can  Insert, 
delete  or  update  a tuple  or  a group  of  tuples  In  the  database.  He  can  also 
assign  the  result  of  a query  to  a newly-created  relation. 

A.  Insertion 


Any  time  after  a relation  has  been  defined,  a new  tuple  may  be  Inserted 
Into  the  database  by  an  INSERT  statement.  As  an  example,  the  following  SEQUEL 
statement  will  Insert  a new  employee  named  'Jones'  with  employee  number  535 
In  Dept.  51,  having  other  attributes  null: 

INSERT  INTO  EMP(EMPNO,NAME,DNO) : 

<535,' JONES', 51> 

To  Insert  such  a tuple,  the  RDBI  first  determines  If  the  relation  has  a 
clustering  link  or  a clustering  Image.  Accordingly,  It  then  creates  a 
keyword  <CLUSTER, cluster  number>  and  then  sends  the  DBG  command 

INSERT:  (<RELATION='EMP’>,<EMPNO=535>,<NAME=' JONES '>, 
<DN0=51>,<CLUSTER=cluster  number>) 
with  primary  clustering  condition  (RELATION^'EMP') 
and  secondary  clustering  condition 
(CLUSTER=cluster  number) 

In  case  the  secondary  clustering  condition  cannot  be  determined  either  due 
to  the  absence  of  a clustering  Image  or  clustering  link,  or  due  to  the  absence 
of  clustering  attributes  In  the  tuple  to  be  Inserted,  then  only  the  primary 
clustering  condition  Is  sent  as  part  of  the  DBG  command. 

Set-oriented  Insertions  In  SEQUEL  consist  of  the  evaluation  of  a query 
and  insertion  of  the  resulting  tuples  Into  some  existing  relation.  Assume, 
for  example,  that  the  database  contains  a relation  called  GANDIDATES  which 
has  columns  for  employee  number,  name,  department  number  and  salary.  The 
following  SEQUEL  statement  will  then  add  to  the  GANDIDATES  table  all  those 
employees  whose  commission  Is  greater  than  half  their  salary: 

INSERT  INTO  GANDIDATES: 

SELEGT  EMPNO, NAME, DNO, SAL 

FROM  EMP 

WHERE  GOMM>0.5*SAL 


The  RDBI  executes  this  statement  In  four  steps: 

(1)  The  query  Is  evaluated  by  using  the  DBG  RETRIEVE  command  .(Part 
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of  the  query  Is,  of  course,  evaluated  in  software  since  the 
arithmetic  operation  0.5*SAL  cannot  be  done  by  the  DBC  hardware.) 

(2)  For  each  tuple  retrieved  In  Step  1,  the  kejrword  <CLUSTER, cluster 
numbei>  Is  generated.  If  the  definition  of  the  CANDIDATES  relation 
allows  It. 

(3)  For  each  tuple,  a DBC  record  Is  formed  as  shown  In  Section  4. 

An  INSERT  command  Is  now  sent  to  the  DBC  for  each  DBC  record 
created  In  Step  3.  The  primary  and  secondary  clustering 
conditions  are  sent  together  with  every  record. 

B.  Deletion 

Deletion,  In  SEQUEL,  Is  done  by  means  of  a DELETE  statement  accompanied 
by  a WHERE  clause.  The  WHERE  clause  specifies  the  conditions  that  must  be 
satisfied  by  the  records  to  be  deleted.  A simple  example  of  the  DELETE 
operation  Is  the  deletion  from  the  EMP  relation  the  employee  with  employee 
number  561.  The  SEQUEL  statement  to  achieve  this  Is: 

DELETE  EMP 
WHERE  EMPNO-561 

The  corresponding  DBC  command  Is: 

DELETE : (RELATION- ' EMP ' ) & (EMPNO-561 ) 

Once  again,  whenever  applicable,  the  cluster  number  to  which  the  record (s) 
belongs  may  be  specified  by  means  of  another  predicate. 

A more  complex  example  that  uses  labels  Is  the  deletion  from  the  DEPT 
relation  all  the  departments  having  no  employees.  A SEQUEL  statement  to  do 
the  job  Is  given  below: 

DELETE  DEPT  X 
WHERE 

(SELECT  COUNT(*) 

FROM  EMP 

WHERE  DNO-X.DNO)«0 

The  RDBI  achieves  the  same  effect  with  the  following  procedure: 

(1)  Send  a DBC  command 

RETRIEVE : (DNO) (RELATION- ' DEPT ' ) 

For  every  department  number  'dl'  retrieved  by  (1),  do  the 
following  steps: 

(2)  Send  a DBC  command 

RETRIEVE : [ COUNT  ONLY ] 


( (RELATION- ’ EMP ')& (DNO- ' dl ') ) 
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(3) 


If  the  result  of  (2)  Is  zero,  then  send  a connnand 
DELETE : ( (RELATION* ' DEPT ' ) & (DNO- ' dl ’ ) ) 


C.  Update 


Updating  a tuple  In  SEQUEL  Is  done  by  an  UPDATE  statement  with  a SET 
clause  specifying  the  updates  to  be  made  on  the  selected  tuples.  The  RDBI 
can  translate  the  UPDATE  statements  In  one  of  two  ways:  1)  By  using  the 
RETRIEVE  command  to  determine  the  MAU  addresses  of  the  selected  records, 
and  then  using  the  REPLACE  command  to  modify  these  records  one  at  a time, 
or  2)  By  using  the  REPLACE  command  to  modify  all  the  selected  records 
simultaneously.  Which  of  these  two  methods  Is  to  be  used  depends  on  the 
actual  SEQUEL  statement.  If  the  SET  clause  makes  Identical  changes  to  all 
the  selected  tuples,  only  then  can  the  second  method  be  used.  We  Illustrate 
the  two  cases  with  two  examples. 


Update  Example  1:  Update  the  EMP  table  by  giving  a 10%  raise  to  all  those 

employees  who  are  In  the  CANDIDATES  relation. 

SEQUEL: 

UPDATE  EMP 

SET  SAL-SAL*!, 1 

WHERE  EMPNO  IN 

SELECT  EMPNO 
FROM  CANDIDATES 

DBC  Commands: 

a.  RETRIEVE : (EMPNO) (RELATION* ' CANDIDATES ' ) 

For  each  employee  number  'el*  retrieved  by  (a)  send  a command 

b.  RETRIEVE: [WITH  POINTER]  (SAL) 

( (RELATION- ' EMP ' ) & (EMPNO- ' el ' ) ) 

Finally,  for  each  pointer  'pi'  and  salary  'sj'  retrieved  by  (b), 
coDq>ute  a new  salary  sk-(l.l*sj)  and  send  a DBC  command 

c.  REPLACE : (pi) (<SAL  - sk>) 

This  command  specifies  that  the  record  pointed  to  by  'pi'  Is  to  get 
the  modified  value  'sk'  for  the  SAL  attribute. 


Update  Example  2:  Update  the  EMP  relation  by  giving  a commission  of  $5000 

to  every  clerk. 

SEQUEL: 

UPDATE  EMP 

SET  COM^^5000 

WHERE  JOB- 'CLERK" 

DBC  Command: 

REPLACE : ( (RELATION- ' EMP ' ) & (JOB- ' CLERK' ) ) 

(<COMM,5000>) 

This  command  specifies  that  all  the  records  satisfying  the  query  conjunct 
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must  have  the  value  of  their  COMM  attribute  modified  to  5000. 

D.  Assignment 

An  assignment  statement  In  SEQUEL  allows  the  result  of  a query  to  be 
copied  Into  a newly-created  relation  In  the  database.  Thus,  the  execution 
of  an  assignment  statement  by  the  RDBI  Is  done  In  two  parts: 

(1)  The  records  satisfying  the  query  are  retrieved  as  shown  In 
Section  5.1,  and 

(2)  A new  relation  Is  created  with  the  records  retrieved  In  (1) . 

These  records  are  then  stored  In  the  database. 
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6.  RELATIONAL  DATA  CONTROL  FACILITIES 

System  R,  our  model  for  a relational  database  management  system,  has 
extensive  data  control  facilities  that  enable  users  to  control  access  to 
their  data  by  other  users,  and  to  exercise  control  over  the  Integrity  of 
data  values.  The  data  control  facilities  have  four  aspects:  transactions, 
authorization.  Integrity  assertions  and  triggers. 

A transaction  Is  a series  of  statements  which  the  user  wishes  to  be 
processed  as  an  atomic  act.  The  user  controls  transactions  by  the  operators 
BEGIN-TRANS  and  END-TRANS.  The  user  may  specify  save  points  within  a 
transaction  by  the  operator  SAVE.  As  long  as  a transaction  Is  active,  the 
user  may  back  up  to  the  beginning  of  the  transaction  or  to  any  Internal  save 
point  by  the  operator  RESTORE,  Implementation  of  transactions  on  the  DBC 
Involves  no  new  concepts.  From  the  save  point  onwards,  any  update  made  by 
the  transaction  will  cause  the  old  version  of  the  updated  record  to  be 
stored  In  the  DBC  mass  memory  as  part  of  a temporary  database.  A subsequent 
RESTORE  command  may  be  executed  by  simply  deleting  from  the  database  all 
updated  records  and  replacing  them  by  their  old  copies  stored  away  In  the 
temporary  database. 

6.1  Views 


System  R relies  on  Its  view  mechanism  for  authorization.  As  opposed 
to  the  base  relations  which  are  physically  stored  In  the  database,  a view 
Is  a virtual  relation  which  Is  a dynamic 'window'  on  the  database.  In 
response  to  a query,  the  tuples  of  a view  are  dynamically  computed  from  the 
base  relation (s).  An  update  to  a view  Is  not  allowed  If  It  Is  defined  on 
more  than  one  base  relation.  We  shall  briefly  describe  below  how  a view 
Is  defined  In  SEQUEL  and  how  It  Is  Implemented  In  the  DBC. 

Any  SEQUEL  query  which  results  In  a relation  may  be  used  to  define 
a view.  The  RDBI  translates  any  SEQUEL  query  by  first  translating  It 
Into  DBC  commands  In  the  normal  way  (as  discussed  In  Section  5) . Any 
reference  to  a view.  In  the  first  step.  Is  treated  as  a base  relation, 
but  the  DBC  commands  thus  formed  are  only  Intermediate  commands  not  to 
be  lomiedlately  transmitted  to  the  DBC.  These  commands  are  now  qualified 
with  predicates  determined  from  the  view  definition.  The  modified  commands, 
then,  are  the  ones  that  are  transferred  to  the  DBC  for  execution.  We  shall 
Illustrate  this  process  by  means  of  the  three  most  Important  cases  of  view 
definition. 
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A view  may  be  a row  and  column  subset  of  a base  relation.  For  exaiq>le, 
the  following  SEQUEL  statement  may  be  used  to  define  a view  called  D50 
containing  the  employee  number,  name  and  job  of  the  employees  who  work  In 
Dept.  50. 


DEFINE  VIEW  D50  AS: 

SELECT  EMPNO, NAME, JOB 
FROM  EMP 
WHERE  DNO-50 

A user  may  now  wish  to  find  the  names  of  all  clerks  In  D50  by  Issuing  the 
SEQUEL  statement 

SELECT  NAME 
FROM  D50 
WHERE  JOB- 'CLERK' 

This  statement  Is  first  translated  Into  an  Intermediate  DBC  command 
RETRIEVE : (NAME) ( (RELATION- 'D50* ) & (JOB- 'CLERK' ) ) 

However,  there  Is  no  stored  relation  called  D50.  Hence,  the  view  definition 
for  D50  Is  used  to  replace  the  predicate  (RELATION- ' D50 ' ) by  ((RELATION- 'EMP ')& 
(DNO- ' D50 ' ) ) , We  thereby  come  up  with  the  modified  DBC  command 
RETRIEVE: (NAME) 

( (RELATION- ' EMP ' ) & (DNO-50) & (JOB- ' CLERK' ) ) 

This  Is  the  command  that  Is  finally  transmitted  to  the  DBC,  after  making 
sure  that  no  reference  Is  made  to  any  field  (In  this  case,  NAME)  that  Is 
not  Included  In  the  view  definition  (In  this  case,  NAME  does  occur  In  the 
definition  of  D50) . 

A view  may  also  be  the  join  of  the  Information  In  two  or  more  relations. 

A view  called  DS,  for  example,  may  be  defined  as  follows  as  a join  of  the 
PART  attribute  of  the  relations  USAGE  and  SUPPLY: 

DEFINE  VIEW  DS  AS: 

SELECT  DNO, SUPPLIER 
FROM  USAGE, SUPPLY 
WHERE  USAGE. PART- SUPPLY. PART 

To  find  the  department  numbers  of  the  departments  to  which  the  supplier 
'Jones'  supplies  any  part,  one  may  Issue  the  following  SEQUEL  statement: 

SELECT  DNO 
FROM  DS 

WHERE  SUPPLIER- 'JONES' 

The  first  step  In  the  translation  of  this  statement  Is  the  creation  of  the 
DBC  command 

RETRIEVE : (DNO) ( (RELATION- 'DS' )& (SUPPLIER- 'JONES ' ) ) 


since  DS  Is  only  a view,  the  view  definition  Is  now  used  to  come  up  with  the 
DBG  commands: 

a . RETRIEVE : (PART) ( (RELATION- ' SUPPLY ' ) & (SUPPLIER- ' JONES ' ) ) 
and  for  each  part  'pi'  retrieved  by  (a),  another  conmand 

b . RETRIEVE : (DNO) ( (RELATION- ' USAGE ' ) & (PART- 'pi')) 

A view  may  be  a summary  of  the  Information  In  a base  relation.  For 
example,  to  define  a view  consisting  of  the  average  salary  of  each  department, 
one  can  Issue  the  SEQUEL  statement: 

DEFINE  VIEW  AVGSAL  AS: 

SELECT  AVG(SAL) 

FROM  EMP 
GROUP  BY  DNO 

Any  reference  to  the  average  salary  of  some  department  'dl'  In  AVGSAL  may 
then  be  translated  to  the  DBG  command 

RETRIEVE : (AVG (SAL) ) ( (RELATION- ' EMP ' ) & (DNO- 'dl ' ) ) 

Although  views  are  used  for  facilitating  the  description  of  a relational 
query,  a more  Important  use  of  the  views  Is  In  the  process  of  authorization. 

The  creator  of  a base  relation  may  grant  (and  may  later  revoke)  any  privilege 
(such  as  READ,  INSERT,  DELETE,  UPDATE)  to  (from)  other  users.  He  may  further 
provide  any  of  these  users  with  the  GRANT  option.  In  that  case,  the  latter 
user  has  the  privilege  to  grant  (revoke)  all  his  privileges  on  the  given 
relation  to  (from)  yet  another  user.  If  a user  Is  authorized  to  create 
a view  on  a base  relation  then  he  has  the  sole  authority  to  perform  any 
action  on  It  consistent  with  his  privileges  on  the  base  relation.  He  may 
also  grant  those  privileges  to  another  user. 

6.2  Authorization  and  Security 

System  R allows  for  an  extremely  simple  method  of  authorization  checking 
[15].  System  R maintains  two  tables  for  the  use  of  the  authorization  subsystem, 
namely,  SYSAUTH  and  SYSCOLAUTH.  The  SYSAUTH  table  has  upto  two  rows  for  each 
combination  of  relation  (base  relation  or  view)  and  user.  The  columns  In  the 
SYSAUTH  table  correspond  to  user  Id,  base  relation  or  view  name,  type  (whether 
base  relation  or  view) , a column  for  each  of  the  privileges  on  the  relation 
(the  entry  being  'Y'  or  'N'  In  every  such  column)  and  a column  for  grant 
option  ('Y'  or  'N').  For  each  relation  on  which  a user  Is  authorized  to 
perform  some  action,  there  are  upto  two  tuples  In  SYSAUTH:  one  for  grantable 
and  the  other  for  non-grantable  privileges. 
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In  case  the  user  has  update  rights  on  a relation,  the  table  SYSCOLAUTH 
Indicates  precisely  those  columns  of  the  relation  on  which  the  user  has  the 
update  privilege.  These  two  tables  SYSAUTH  and  SYSCOLAUTH  are  updated 
whenever  a new  base  relation  or  view  Is  created  or  whenever  an  authorized 
user  executes  a GRANT  statement  thereby  granting  a set  of  privileges  to  one 
or  more  other  users.  The  two  tables  are  referenced  Immediately  before  the 
execution  of  any  SEQUEL  statement.  Since  all  SEQUEL  statements  refer  to 
an  action  and  to  one  or  more  relations,  authorization  checking  can  be 
accomplished  by  the  RDBI  even  before  the  statements  are  translated  to  DBG 
commands.  The  checking  process  Is  exactly  Identical  to  that  In  System  R. 

As  a possibility,  however,  an  extra  dimension  can  be  added  to  the 
authorization  mechanism  of  System  R by  using  the  hardware  security 


enforcement  feature  of  the  DBC.  Although  the  view  mechanism  of  System  R 


Is  Important  for  Its  use  In  specifying  authorization,  a view  need  not 
necessarily  be  used  for  that  purpose  alone.  A view  may  be  defined,  for 
example,  for  the  sole  purpose  of  simplifying  SEQUEL  queries.  Therefore, 


It  may  not  be  practical  to  represent  every  view  of  a base  relation  In  the 
security  module  of  the  DBC.  Furthermore,  this  Is  not  necessary  since 
security  enforcement  by  views  and  relations  can  be  easily  done  even  before 
the  SEQUEL  statements  are  translated  to  DBC  commands. 

Suppose,  now,  that  many  different  row-subsets  of  a relation  are  to  be 
authorized  for  access  by  various  users.  Then  a view  must  be  defined  for 
every  such  subset,  and  each  user  must  be  aware  of  the  names  of  the  views 
corresponding  to  the  row-subsets  he  Is  allowed  to  access.  A better  scheme 
would  be  to  somehow  allow  each  user  access  to  appropriate  logical  subsets 
of  the  base  relation  without  requiring  the  user  to  remember  the  names  of 
these  subsets.  In  any  SEQUEL  statement,  the  user  can  then  make  a reference 
directly  to  the  base  relation.  If  the  statement  requires  the  retrieval 
(or  update)  of  any  part  of  the  relation  on  which  he  has  no  authorization 
for  the  corresponding  action,  then  the  statement  will  not  be  executed. 

Such  a scheme  Is  Ideally  Implemented  on  the  DBC.  The  values  of  any 


attribute  to  be  used  for  authorization  may  be  partitioned.  Access 
privileges  of  any  user  may  be  specified  In  terms  of  predicates  using  the 
above  attributes.  The  specifications  of  the  access  privileges  are  stored 


In  the  DBC  structure  memory.  Users  may  then  write  SEQUEL  statements  that 
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authorlzatlon  specifications  of  the  user  determined  automatically  by  the 
DBC  allow  such  operations. 

Here  Is  a simplified  exposition  of  how  the  DBC  security  mechanism  works. 

The  creator  of  a file  may  specify  certain  attributes  as  security  attributes. 

The  file  creator  also  specifies  security  descriptors  that  partition  the  values 
of  the  security  attributes. 

The  capabilities  of  a user  are  specified  In  the  form  of  file  sanctions, 
a file  sanction  being  a pair  <predlcate  conjunct,  access  prlvllege>.  Each 
predicate  In  a file  conjunct  Is  restricted  such  that  It  must  refer  to 
complete  blocks  of  the  partition  made  by  the  corresponding  security  descriptors 
(l.e.,  the  descriptors  whose  attribute  parts,  agree  with  the  attribute  part 
of  the  predicate).  In  addition  to  the  file  sanctions,  the  user  capability 
also  consists  of  certain  default  access  privileges  applicable  to  any  record 
that  does  not  satisfy  the  predicate  conjunct  In  any  file  sanction. 

The  DBC  will  then  create  an  atomic  access  privilege  list  (AAPL)  for 
the  given  user.  An  atom  Is  a set  of  records  consisting  of  keywords  that 
satisfy  the  exactly  same  security  descriptors . Using  the  user  capabilities 
and  the  security  descriptors  of  the  file,  the  DBC  creates  an  AAPL  for  each 
user.  An  AAPL  consists  of  entries  of  the  form  <atom  number,  access  prlvllege>. 
The  construction  of  an  AAPL  Is  done  only  once,  l.e.,  one  AAPL  Is  constructed 
for  each  <flle,user>  pair. 

During  the  actual  command  execution  time,  the  DBC  refers  to  the  AAPL 
corresponding  to  the  user  who  makes  the  command  and  the  file  to  which  the 
command  refers.  For  every  predicate  conjunct  within  the  command,  the  DBC 
determines  the  atoms  to  which  It  will  refer.  Using  the  AAPL,  the  DBC  now 
decides  whether  the  requested  access  Is  grantable  or  not.  Only  If  the  access 
Is  grantable  for  every  referred  atom  Is  the  command  finally  carried  out . 

As  an  example,  consider  that  there  Is  a file  with  two  security  attributes, 
JOB  and  SALARY.  The  security  descriptors  are  as  given  below; 

SDl.  0<SALARY<5000 
SD2.  5000<SALARY<10000 
SD3.  10000<SALARY530000 
SD4.  30000<SALARY<10000000 
SD5.  JOB- 'CLERK' 

SD6.  JOB-' ANALYST' 

SD7.  JOB- 'MANAGER' 


Further  assume  that  the  capability  of  a particular  user  Is  given  as : 


r^T 


sfn 
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File  Sanction 

Access  PrlvUege 

(1) 

(SALARY<10000) 

READ , UPDATE , DELETE 

(2) 

(SALARY>10000)&(JOB-’ANALYST' ) READ, UPDATE 

(3) 

Default 

READ 

The 

DBG  now  creates 

an  AAPL  for  the  given  user.  There  are  12  atoms 

and  they  have  the  following  access  privileges: 

• « 

(1) 

SD1,SD5 

READ , UPDATE , DELETE 

(2) 

SD1,SD6 

READ .UPDATE , DELETE 

« m 

(3) 

SD1,SD7 

READ .UPDATE .DELETE 

(4) 

SD2,SD5 

READ , UPDATE , DELETE 

• • 

(5) 

SD2,SD6 

READ , UPDATE , DELETE 

(6) 

SD2,SD7 

READ , UPDATE , DELETE 

* • 

(7) 

SD3,SD5 

READ 

(8) 

SD3,SD6 

READ .UPDATE 

(9) 

SD3,SD7 

READ 

• m 

(10) 

SD4,SD5 

READ 

- - 

(11) 

SD4,SD6 

READ, UPDATE 

m m 

(12) 

SD4,SD7 

READ 

- - 

A DBG  command  made  by  this  user  can  now  be  compared  against  his  AAPL 

• * 

to  determine  whether  the  required  access  Is  to  be  granted.  For  exaiiq}le.  If 

- • 

the  command  Is  to  update 

some  field (s)  of  any  record  belonging  to  an  analyst 

earning  more  than  12000,  then  the  atoms  referred  to  are  atom  8 (which  satisfies 
SD3  and  SD6)  and  atom  11  (which  satisfies  SD4  and  SD6) . The  update  privileges 
are  grem table  for  both  these  atoms.  Therefore,  the  command  Is  carried  out. 

On  the  other  hand.  If  the  command  Is  to  update  some  field (s)  of  any  clerk 
earning  between  6000  and  15000,  then  the  atoms  referred  to  are  atom  4 (which 
satisfies  SD2  and  SD5)  and  atom  7 (which  satisfies  SD3  and  SD5) . VAille  the 
update  privilege  Is  grantable  for  atom  4,  It  Is  not  grantable  for  atom  7. 
Therefore,  the  given  command  Is  not  carried  out,  l.e.,  the  requested  access 
Is  denied. 

6.3  Assertions  and  Trla;gers 

Another  Important  aspect  of  data  control  as  provided  In  System  R Is 
that  of  assertions  about  data  Integrity.  Any  SEQUEL  logical  expression 
(e.g.,  AVG(SAL)>20000)  associated  with  a base  relation  or  view  may  be  stated 
as  an  Integrity  assertion.  At  the  time  an  assertion  Is  made  (by  an  ASSERT 
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statement) , Its  truth  Is  checked;  If  true,  the  assertion  Is  enforced  until 
It  Is  explicitly  dropped  (by  a DROP  ASSERTION  statement) . As  examples  of 
the  Implementation  of  assertions  on  the  DBG,  consider  the  following: 

Example  1:  No  employee  should  have  a salary  greater  than  50000.  In  SEQUEL, 
this  assertion  Is  written  as: 

ASSERT  ON  EMP:  SAL  < 50000 

In  the  DBG,  this  assertion  Is  first  checked  to  be  true  by  means  of  the  command 
RETRIEVE: [GOUNT  ONLY] ((RELATION- 'EMP')&(SAL>50000)) 

If  the  count  Is  zero,  then  the  assertion  Is  true  to  start  with.  Thereafter, 
after  any  update  to  employee  records,  the  SAL  field  Is  checked  to  verify 
that  the  truth  of  the  assertion  still  holds. 


Example  2:  The  salary  of  an  employee  should  never  decrease.  This  Is  written 
In  SEQUEL  as: 

ASSERT  ON  UPDATE  TO  EMP:  NEW  SAL  > OLD  SAL 
This  assertion  requires  no  Initial  chekclng.  Only  during  an  update  on  the 
EMP  relation,  must  It  be  checked  If  the  SAL  field  Is  going  to  be  modified. 

In  such  a case,  an  extra  retrieval  query  may  be  required  by  the  DBG  In  order 
to  retrieve  the  records  that  are  going  to  be  affected  by  the  update;  the 
assertion  Is  checked  via  software,  and  then  the  update  Is  made  If  the  assertion 
holds. 

A final  data  control  aspect  of  System  R Is  the  concept  of  triggers,  which 
Is  a generalization  of  the  concept  of  assertions.  A trigger  causes  a 
prespecified  sequence  of  SEQUEL  statements  to  be  executed  whenever  some 
trlggerlr%  invent  occurs,  such  as  retrieval,  deletion.  Insertion  or  update  of 
a base  relitloii  or  view.  The  ROBI  can  monitor  such  events  by  simply  scanning 
a transaction  for  SEQUEL  statements  that  correspond  to  a particular  triggering 
event.  Immediately  after  each  of  these  statements,  a call  statement  Is 
Included  to  Invoke  the  appropriate  trigger  routine.  The  modified  transactions 
and  the  trigger  routines  are  now  translated,  as  usual.  Into  DBG  commands. 


L 
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7.  PERFORMANCE  ANALYSIS 

Because  of  the  parallelism  Involved  In  the  operations  performed  by  the 
DBG,  It  should  be  Intuitively  clear  that  user  transactions  will  run  faster 
on  the  DBC  than  on  a conventional  computer.  The  speed  Is  further  enhanced 
by  the  fact  that  a sequence  of  software  operations  can  be  replaced  completely 
by  a single  DBC  command.  For  example.  In  order  to  find  all  the  records 
satisfying  a conjunct  of  predicates,  a conventional  system  will  first  deter- 
mine (In  some  manner, e.g.,  via  an  Index)  the  eligible  records.  It  will  then 
retrieve  these  records  and  compare  each  of  them  against  the  given  predicates. 
In  the  DBC,  on  the  other  hand,  not  only  are  all  the  eligible  records  retrieved 
In  parallel,  but  It  Is  also  true  that  this  set  of  retrieved  records  Is  exactly 
the  required  response  set.  ilie  reason  Is  simply  that  records  are  compared 
against  the  given  predicates  simultaneously  with  their  retrieval,  thereby 
rendering  unnecessary  any  subsequent  software  refinement  of  the  retrieved  set. 

In  this  section,  we  shall  make  an  analytical  study  of  the  DBC  performance 
and  compare  It  against  that  of  a conventional  computer  where  a relational 
database  management  system  (In  particular.  System  R)  Is  being  supported.  The 
DBC  will  ftinctlon  In  conjunction  with  a front-end  conqputer.  The  support 
software,  which  we  have  been  calling  the  Relational  Data  Base  Interface (RDBl) 
Is  housed  In  the  front-end  computer.  The  RDBl  Interprets  the  data  management 
calls  of  user  programs  and  executes  them  with  the  aid  of  the  DBC.  We  shall 
call  the  environment  consisting  of  the  DBC  and  front-end  computer  as  the 
DBC  environment.  A conventional  system,  on  the  other  hand,  consists  only  of 
a general-purpose  computer  (GPC)  which  houses  the  database  management  system 
software  and  executes  user  transactions  by  reading  (writing) record  from  (to) 
conventional  secondary  storage  devices.  We  shall  call  such  an  environment 
a GPC  environment. 

Although  the  response  time  and  throughput  of  a system  Is  the  most  widely 
used  measure  of  performance  of  the  system,  the  cost  of  the  system  Is  surely 
a complementary  measure  since  a tradeoff  Is  Involved  between  the  cost  and 
speed.  We,  therefore,  will  start  with  estimating  the  coat  of  a system  by 
determining  the  storage  requirement.  The  cost  of  the  other  components  of  a 
system  will  not  be  considered  since  they  are  fixed  (e.g.,  the  DBC  Is  Itself  a 
fixed-cost  component  In  a DBC  environment).  We  will  then  go  on  to  measure 
response  time  In  terms  of  the  number  of  secondary  storage  accesses  and 
processor  time  required  to  complete  the  execution  of  user  transactions. 
Numerical  results  are  also  computed  for  typical  database  parameters. 


7.1  Mass  Storage  Requirement 


The  mass  memory  of  Che  DBC  stores  Che  database  records.  Correspondingly, 
Che  secondary  storage  of  a conventional  relational  system  stores  the  tuples. 
Here,  In  this  section,  we  shall  try  to  estimate  this  storage  requirement. 

The  following  definitions  will  be  used: 

n = relation  cardinality  (#  of  tuples  or  records  In  the  relation) ; 

d = degree  of  a relation  (//  of  fields); 

p = length  of  a pointer  field  (or  Tuple  Identifier,  TID) , In  number 
of  bytes; 

/ ■ //  of  links  defined  on  a relation; 

average  length,  in  bytes,  of  the  value  of  the  1-th  attribute  of 
a relation;  and 

aj^=  average  length.  In  bytes,  of  the  1-th  attribute  name  of  a relation. 

We  will  ignore  In  our  analysis  the  details  of  certain  inqjlementation 
features  since  they  are  non-standard  and  have  only  marginal  effect  on  the 
storage  requirement.  For  exaiiq>le,  the  loading  factor  of  the  physical  blocks 
may  be  maintained  at  a level  slightly  below  unity  In  order  to  allow  for 
database  growth.  Even  if  the  loading  factor  were  to  have  any  appreciable 
effect  on  the  storage  requirement  of  a given  system,  this  effect  will  be 
nullified  when  we  compare  two  different  systems  against  one  another. 


GPC  Environment 


In  a conventional  implementation  of  System  R,  every  physical  tuple 

consists  of  an  ordered  list  of  values  and  a pointer  (or  TID  field) for  every 

link  defined  on  the  relation  to  which  the  tuple  belongs.  Thus,  the  mass 

storage  requirement,  M , for  a given  relation  Is 

S 

Mg  = n(|vi  + P^). 

In  case,  v^^  = v,  for  every  1,  we  have 

M » n(vd  + p^) . 

6 


DBC  Environment 


For  each  relation  of  cardinality  n,  the  DBC  stores  n records.  A record 
Is  composed  of  d attribute-value  pairs  If  the  degree  of  the  relation  Is  d,  A 
record  also  contains  a special  keyword  with  attribute  RELATION  to  Identify  the 
relation  to  which  It  belongs.  In  addition,  assuming  that  a clustering  link 
or  a clustering  Image  has  been  defined  on  the  relation,  another  special  keyword 


with  attribute  CLUSTER  Is  also  Included  in  the  record.  Thus,  the  mass  storage 
requirement,  Mj,for  any  given  relation  Is 

where  the  two  special  keywords  are  numbered  (d+l)-th  and  (d+2)-th,  respectively. 

Since  the  DBC  assigns  a unique  fixed- length  code  to  each  attribute.  It 
follows  that  aji^  “ a,  for  every  1,  We,  therefore,  have 

Mj  •»  na(d+2)  + n Z v . . 

° d+2  ^ 

Further,  If  for  every  IfV^^  “ v,  we  have 

= n(d+2)(v+a). 

We  now  define  the  mass  storage  ratio  as  the  ratio  of  mass  storage 
requirements  In  the  two  different  environments,  namely,  the  GPC  environment 
and  the  DBC  environment.  Therefore, 

“ Mg/M^  = + p^)/(Z^^v^  + (d+2)a) 

If  V£  ” V for  every  1,  we  have 

Rj,  = (vd  + p^)/((d+2)  (v+a)) 

In  Figure  7.1,  we  have  tabulated  the  mass  storage  ratio  Rq^  for  p 4 
bytes,  a “ 2 bytes  and  various  values  of  v,l  and  d.  Since  the  number  of 
attributes  In  a file  Is  small,  a length  of  2 bytes  for  attributes  name  should 
be  sufficient.  The  average  length  of  the  value  part  of  an  attribute-value 
pair  is  varied  In  steps  of  2,  from  2 to  8.  Since  the  number  of  links  de'^lned 
on  a relation  Is  not  likely  to  exceed  the  number  of  attributes  (unless  an 
attribute  appears  In  a number  of  links,  each  connecting  two  relations),  we 
may  assume  that  in  a practical  database,  < £ d.  Thus,  we  notice  from  Figure  7.1 
that  R^  Is  ustially  less  than  unity.  Furthermore,  since  the  number  of  links 
defined  on  a relation  Is  usually  one  or  more,  the  value  of  is  likely  to  be 
greater  than  0.5.  That  Is,  In  a practical  database, 

0.5  S Rb,  £ 1.0 

We,  therefore,  conclude  that  the  mass  memory  requirement  In  a DBC  environment 
Is  somewhat  more  than  and  up  to  double  the  requirement  in  a GPC  environment. 

7.2  Directory  Storage  Requirement 


While  the  mass  memory  stores  the  database  files  containing  the  tuples 
or  records,  storage  Is  also  required  for  keeping  the  Indexes  (directories) 
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and  Che  database  definition.  The  database  definition  consists  of  the 
characteristics  of  every  relation  such  as  relation  name,  degree,  attribute 
names  and  types,  names  and  definition  of  links  and  Images,  and  definition  of 
triggers  and  assertions.  The  database  definition  constitutes  the  conceptual 
view  or  schema  of  the  database.  It  must  be  stored  by  any  system  that  Implements 
the  given  database.  Thus,  the  storage  requirement  for  the  schema  Is  Independent 
of  the  machine  on  which  the  database  management  system  Is  being  Implemented. 

We  shall, therefore,  make  no  further  attenqpt  to  estimate  the  memory  requirement 
for  the  schema. 

More  Important  Is  Che  amount  of  memory  occupied  by  the  Indexes.  The  size 
and  structure  of  the  Indexes  varies  from  one  realization  of  the  database  to 
another  depending  on  the  machine  which  supports  the  database.  This  Is 
particularly  true  when  one  machine  uses  conventional  location-addressed 
secondary  storage  and  the  other  Is  a database  machine  using  (partitioned) 
content-addressable  memory  and  having  Che  capabilities  for  hardware  maintenance 
of  directories.  We  shall  now  analyze  the  directory  storage  requirement  In  the 
two  different  cases. 

GPC  Environment 

In  System  R,  each  Image  Is  an  Index  to  a relation.  System  R's  Relational 
Storage  System  (RSS)  maintains  each  Image  through  Che  use  of  a multi-page  Index 
structure.  Each  Index  Is  a dense  Index  In  the  sense  that  every  value  of  the 
underlying  attribute  or  attribute  combination  Is  represented  In  the  Index, 
thereby  making  It  possible  to  determine  the  address  of  every  record  satisfying 
an  equality  predicate  based  on  the  above  attribute  or  attribute  combination. 

The  pages  for  a given  Index  are  organized  Into  a balanced  hierarchy  structure, 
called  B-trees  [16], 

A B-tree  of  order  s Is  a tree  which  satisfies  the  following  properties : 

(1)  Every  node  has  5 s sons. 

(2)  Every  node,  except  for  the  root  and  the  leaves  has  ^ m/2  sons. 

(3)  The  root  has  at  least  2 sons. 

(4)  All  leaves  appear  on  the  same  level,  and  carry  no  Information. 

(5)  All  non-leaf  nodes  with  k sons  contain  (k-1)  keys. 

In  Figure  7.2,  we  have  extracted  from  [16]  an  example  B-tree  of  order  7,  with 
Che  root  at  level  0 and  with  all  leaves  at  level  3. 

In  System  R,  the  B-trees  for  the  Images  slightly  differ  from  the  above 
definition.  Since  the  records  are  not  actually  stored  In  the  B-trees,  their 
addresses  (or  TIDs)  must  be  stored.  Thus  the  leaves  are  not  empty;  they  carry 
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a list  of  TIDs.  The  B-trees  In  System  R are  defined  as  follows.  Each  page 
Is  a node  within  the  tree  and  contains  an  ordered  sequence  of  Index  entries. 

For  each  non- leaf  node  , an  entry  consists  of  a <sort  value, polnter>  pair.  The 
pointer  addresses  another  page  In  the  same  structure  which  may  be  either  a 
leaf  page  or  another  non-leaf  page.  In  either  case,  the  target  page  contains 
entries  for  sort  values  less  than  or  equal  to  the  given  one.  For  the  leaf 
nodes,  an  entry  Is  a combination  of  sort  values  along  with  an  ascending  list 
of  TIDs  for  tuples  having  exactly  those  sort  values.  The  leaf  pages  are 
chained  In  a doubly-linked  list,  so  that  sequential  access  can  be  supported 
from  leaf  to  leaf.  The  structure  of  the  nodes  In  such  a B-tree  is  depicted 
In  Figure  7.3. 

To  compute  the  storage  requirement  per  Image  we  use  the  following 
nomenclature: 

n = relation  cardinality  (//  of  tuples  In  the  relation)  ; 
p “ length  of  an  Internal  pointer.  In  bytes  ; 
t “ length  of  a TID,  In  bytes  ; 

V * average  length  (In  bytes)  of  a value  of  the  attribute  on  which  the 
Image  Is  defined.  We  assume  that  the  Image  is  a single-attribute 
Image^slnce  this  Is  the  most  common  case  ; 
s = order  of  the  B-tree.  The  order,  which  determines  the  range  of  the 

number  of  pointer  (and  key)  fields  In  each  non-leaf  page  (Internal  node) , 
depends  on  page  size,  on  average  key  length  v,  and  on  the  length  p of 
an  Internal  pointer; 

1 <■  Image  cardinality,  which  Is  the  number  of  distinct  sort  field  values 
In  the  Image;  and 
bg  " page  size.  In  bytes. 

For  given  parameters  n,p,t,v,l  and  bg  we  shall  try  to  compute  the  minimum 
amount  of  memory  required  to  store  an  Image.  We  begin  by  computing  the  expected 
minimum  number  of  external  nodes  In  the  B-tree.  We  then  compute  the  order  s 
of  the  B-tree.  Next  we  compute  the  minimum  number  of  Internal  nodes,  thereby 
completing  the  analysis. 

Since  the  average  number  of  TIDs  per  key  Is  n/1,  we  may  expect 

(bg  - 2p)/(v  + (n/l)t) 

keys  per  external  node.  Hence,  the  minimum  number  of  external  nodes  E Is  given 
by 

E - fKv  + (n/l)t)/(bg  - 2p)l 
- fCiv  + nt)/(bg  - 2p)'| 


(i)  Structure  of  an  internal  node  in  the  B-tree 


! 
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(ii)  Structure  of  an  external  (leaf)  node  in  the  B-tree  (p^  is  a 

backward  pointer  to  the  preceding  leaf  page  and  p^  is  a forward 
pointer  to  the  next  leaf  page) 


Figure  7.3  Nodes  in  a B-tree 


The  order  s of  the  B-tree,  which  Is  the  maximum  number  of  pointer  fields 
In  each  Internal  node,  is  given  by 


8 - L(bg  - p)/(v  + p)J  + 1. 


In  order  to  compute  the  minimum  number  of  Internal  nodes,  I,  notice  that 
there  are  E nodes  In  level  u,  where  u Is  the  maxlmtan  level  of  the  tree;  there 
are  at  least  |E/s|  nodes  in  level  (u-1) , at  least  |e/s^|  nodes  in  level  (u-2),..., 
at  least  |e/s'‘|  nodes  In  level  0.  Since  there  Is  only  one  node  In  level  0,  it 
follows  that 

E/8“  < 1 < E/s“"^. 

Thus,  u - floggEl. 

The  minimum  number  I of  Internal  nodes  Is  now  given  by 

I - [e/sI  + [e/s^]  + ...  + fE/s'*! 

> E(8'»-1  - - s“) 

In  most  practical  situations  s is  large  and,  therefore,  even  If  u Is  small 
(say,  2 or  3),  s'^"^  » 1.  Hence, 

I a.  E/Cs^  - s). 

Finally,  the  minimum  directory  storage  requirement  per  Image,  Dg,  Is  given  by 
Dg  - (E  + I)  pages 
“ (E  + bytes 

■ E(1  + (l/(s2  - 8)))bg  bytes 

In  the  above  calculations  we  have  assumed  the  fact  that  every  Internal  node 
has  s pointers.  Due  to  updates  on  the  database.  It  Is  more  likely  that  there 
will  be  approximately  0.75s  pointers  per  Internal  node,  since  the  number.  In 
steady  state.  Is  likely  to  be  uniformly  distributed  between  0.5s  and  s.  The 
directory  storage  requirement  will  therefore  be  somewhat  greater  than  what 
the  above  calculations  Indicate. 

DBG  Environment 

Even  though  the  RDBI  maintains  no  directories  corresponding  to  the  Images 
and  links  defined  on  relations,  some  minimal  directories  are.  In  fact, 
maintained  In  the  structure  memory  of  the  DBG.  We  will  now  try  to  estimate 
the  size  of  such  directories. 

To  begin  with,  we  may  recall  that  there  are  directory  entries  for  only 
two  classes  of  keywords:  those  with  attribute  RELATION  and  those  with  attribute 
GLUSTER.  Since  these  keywords  are  also  defined  to  be  clustering  keywords,  the 
DBG  assigns  a unique  cluster  number  to  all  records  having  the  same  two  keywords 
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<RELATION,  r-naine>  and  <CLUSTER,  c-num>.  Thus,  a cluster  In  the  DBC  consists 
of  the  set  of  records  S such  that  two  records  R1  and  R2  are  In  S If  and  only  If 
<RELATION,  r-namel>,<CHJSTER,  c-nuinl>  € Rl, 

<RELAT10N,  r-name2>,  <CLUSTER,  c-nuiii2>  € R2, 
r-namel  ■ r-naine2 , and  c-numl  = c-num2  . 

A directory  entry  In  the  DBC  Is  of  the  following  form  (where  we  have  Ignored 
security  atom  numbers)  : 

<keyword,  (Indexl,  lndex2,....  Index  h)  > 
where  each  Index  Is  of  the  form  : 

(MAU#,  cluster^) 

We  use  the  following  nomenclature: 
a > length  of  a (coded)  attribute  name.  In  bytes^ 

V ■ average  length  In  bytes  of  the  (coded)  value  part  of  the  keywords 
with  attributes  RELATION  and  CLUSTER; 
c “ numbers  of  clusters  of  a relation  (usually  of  the  order  of  the 
number  of  MAUs  required  to  store  the  relation)  ; 
m ■ length  of  an  MAU#,  In  bytes; 
k * length  of  a cluster//.  In  bytes;  and 

J > average  number  of  MAUs  spanned  by  a cluster  (the  number  of  MAUs  spanned 
by  a cluster  Is  defined  to  be  the  number  of  MAUs  In  which  there  Is  at 
least  one  record  belonging  to  the  cluster) • 

The  number  of  different  Index  terms  (l.e. , (MAU#,cluster#)palrs)  for  a relation 
Is  simply  equal  to  cj.  Since,  for  any  given  relation,  there  Is  only  one 
directory  keyword  with  attribute  RELATION,  the  corresponding  directory  must 
have  all  the  Index  terms  for  the  relation.  On  the  other  hand,  there  are  up 
to  c directory  keywords  with  attribute  CLUSTER,  and  each  of  the  corresponding 
entries  has  an  average  of  j Index  terms.  Thus,  the  directory  memory  requirements 
for  a relation  Is  given  by 

Dd  ■ storage  for  the  entry  with  keyword  <RELATION,  - > 

+ storage  for  all  entries  with  keywords  of  the  form  <CLUSTER,  - > 

- ( (a  + v)  + cj  (m  + k) ) 

+ c((a  + v)  + J(m  + k)) 

■ (c  + 1)  (a  + v)  + 2cj  (m  + k)  . 

We  observe  that  the  directory  memory  requirement  per  relation,  Dj,  of 
the  DBC  Is  Independent  of  the  total  number  of  Images  defined  on  a relation. 

This  contrasts  with  the  fact  that  In  a GPC  environment  the  directory  memory 
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requlrement  per  relation  la  the  sum  total  of  the  storage  requirements  for 
all  Images  on  a relation.  If  there  are  L Images  on  a relation  and  each  Image 
requires  the  same  space  D^,  then  the  directory  memory  requirement  per  relation. 
In  the  GPC  environment.  Is  LDg,  We  define  the  directory  storage  ratio  Rj 
as  the  ratio  of  the  directory  memory  requirement  In  the  GPC  environment  to 
that  In  the  DBC  environment.  If  there  are  L Images  per  relation  and  every 
Image  Is  of  equal  size,  we  then  have 

R.  » 

d g d 

where  = (c+1) (a+v)  + 2cj (mfk) , 

Dg  = E(1+(1/(s2  - s)))bg, 

E = [(Iv  + nt)/(bg  - 2p)l, 
and  s « L(bg  “ p)/(v+p)J  + 1. 


In  the  computation  of  D^,  the  value  of  j,  which  Is  the  number  of  MAUs 
spanned  by  a cluster.  Is  a dependent  parameter.  It  depends  on  the  cluster 
size,  MAU  size,  loading  factor  of  the  database  and  the  storage  pattern.  We 
conducted  a niimber  of  simulation  experiments  to  estimate  the  value  of  j. 

For  a given  number  of  clusters  c.  It  was  assumed  that  there  Is  an  equal 
probability  of  any  record  belonging  to  any  given  cluster.  The  essential 
structure  of  each  experiment  Is  summarized  below  In  the  form  of  an  algorithm  : 


Step  0.  To  start  with,  all  MAUs  and  all  clusters  are  empty. 

Step  1.  If  enough  records  have  been  generated  so  that  the  loading 
factor  of  the  mass  memory  Is  1,  then  compute  statistics  and 
stop.  Else  go  to  2. 

Step  2.  Generate  a random  record  and  determine  Its  cluster  number, 

CN.  (It  Is  actually  enough  to  generate  only  a random  cluster 
number  CN,  since  all  records  are  assumed  to  be  of  equal  size). 

Step  3.  If  the  cluster  CN  was  previously  empty,  then  select  an  MAU 
which  Is  currently  being  occupied  by  the  least  number  of 
records.  Assign  that  MAU  to  the  cluster  CN,  store  the  record 
In  that  MAU  and  go  back  to  1. 

Step  4.  If  the  cluster  CN  Is  not  empty,  then  It  has  already  been 

assigned  one  or  more  MAUs,  all  of  which,  except  one.  Is  known 
to  be  full.  Select  from  these  MAUs,  the  only  one  MAU#,  M, 
which  possibly  Is  still  not  full. 

Step  5.  If  M is  not  full  then  store  the  record  In  that  MAU  and  go 
back  to  1.  Else  go  to  6. 

Step  6.  Select  an  MAU  which  Is  currently  being  occupied  by  the  least 
number  of  records.  Add  that  MAU  to  the  list  of  MAUs 
assigned  to  the  cluster  CN,  store  the  record  In  that  MAU 
and  go  back  to  1. 

A total  of  40  simulation  runs  were  made  In  all,  for  100,000  fixed-length 
records  In  each  case.  An  experiment  was  conducted  for  every  combination  of 
(1)  total  number  of  MAUs,  taken  from  the  set  {50,100,200,400,800}, 


(2)  ratio  of  total  number  of  clusters  to  the  total  number  of  MAUs, 


taken  from  the  set  {1,2, 4, 8}  and 
(3)  loading  factor,  taken  from  the  set  {0.9,0.95}. 

Notice  that,  since  the  number  of  MAUs,  the  number  of  records  and  the  loading 
factor  are  constant  for  any  given  experiment,  the  length  of  the  fixed-size 
records  for  that  experiment  Is  automatically  determined  (In  terms  of  the  size 
of  an  MAU) . For  example.  If  there  are  100,000  records,  100  MAUs  and  the 
loading  factor  Is  0.9,  then  the  ratio 

Record  Slze/MAU  Size  - 100  * 0.9/100000  - 0.0009 

We  observe  from  the  tables  In  Figure  7.4  that  even  at  a very  high  loading 
factor,  0.95,  the  value  of  j does  not  exceed  2.  In  practice,  even  when  database 
updates  are  taken  Into  account,  as  long  as  the  loading  factor  does  not  exceed 
0.95,  say,  we  do  not  expect  a cluster  to  span  more  than  two  MAUs,  on  the  average. 
Therefore,  we  anticipate  the  following  bounds  on  j 

1 5 j < 2. 

Coming  back  to  the  computation  of  the  directory  storage  ratio  R^,  we  assume  jj 

that  there  Is  only  one  Image  per  relation,  l.e.,  L^l.  Consider  the  following 
parameters: 

a >■  length  of  a coded  attribute  name  •>  2 bytes  ; 
j “ # of  MAUs  spanned  by  a cluster  ■ 2 ; 

V ■ length  of  the  value  of  an  attribute  = 4 bytes ; 
m length  of  an  MAU  # “ 4 bytes  ; 
k * length  of  a cluster  //  " 4 bytes  ; 
t ■ length  of  a TID  = 4 bytes  I 

p > length  of  an  Internal  pointer  In  a B-tree  » 4 bytes  ; 
bg"  page  size  (size  of  a node  In  the  B-tree)  = 4000  bytes  ; 

(This  Is,  perhaps,  larger  than  usual;  but,  then,  smaller  bg  will 
only  reduce  Dg,  thereby  reducing  the  storage  ratio)  : 
r - ratio  of  the  number  of  clusters  of  a relation  to  the  number  of 
MAUs  occupied  by  the  relation  = 5 ; 


bj-  MAU  size  = 500,000  bytes; 

n number  of  records  In  the  relation,  taken  from  the  set  {1000,2000, 
5000, 10000 , 20000 ,50000,100000); 

n/1  - ratio  of  relation  cardinality  to  Image  cardinality,  taken  from  the 
set  {1,2,5,10,20,50,100};  and 

q ■ length  of  a DBC  record.  In  bytes,  taken  from  the  set  {50,100,200,500, 


1000,2000}.  . 
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Uslng  the  fact  that  the  number  of  MAUs  required  for  a relation  la 
(nq/d  1 and  the  fact  that  c,  the  number  of  clusters  of  the  relation.  Is  r times 
the  above-mentioned  nuadier,  we  can  now  compute  the  directory  storage  ratio 

These  calculations  are  tabulated  In  Figure  7.5.  Observe  that^other  para- 
meters remaining  unchanged,  after  the  number  of  records,  n,  has  reached  a high 
enough  value,  further  Increase  In  n does  not  have  much  effect,  since  both  Dg 
and  tend  to  Increase  proportionately  with  n,  for  large  n.  Further  observe 
that  as  DBG  record  length  Increases,  fewer  and  fewer  records  are  accommodated 
In  an  MAU,  thereby  Increasing  the  number  of  Index  terms  and  hence  the  storage 
ratio  R^. 

We  notice  that  for  a reasonable  record  length  between  100  and  1000  bytes, 
the  DBG  directory  memory  requirement  lies  between  0.05%  and  10%  of  that  of 
a conventional  system.  Furthermore,  If  there  are  more  than  one  Image  per 
relation  (which  Is  often  the  case) , then  the  directory  memory  requirement 
In  a GPG  environment  Increases  proportionately  with  the  number  of  Images . The 
DBG  directory  memory  requirement.  In  contrast,  remains  steady. 

7.3  Query  Execution  Time 

Query  execution  time  Is  perhaps  the  single  most  Important  measure  of 
performance  of  a database  management  system*  Given  a SEQUEL  query  In  a 
conventional  GPG  environment,  the  system  first  uses  an  optimizer  to  determine 
a good  access  strategy  from  among  a large  number  of  possible  access  strategies. 
Ignoring  the  parsing  and  optimization  time,  the  execution  time  of  a query 
consists  mainly  of 

(1)  the  time  to  access  a number  of  Index  pages  and  search  their  contents 
In  order  to  determine  a list  of  eligible  TIDs, 

(2)  the  time  to  access  a number  of  data  pages  In  order  to  fetch  the 
eligible  tuples,  and 

(3)  the  GPU  time  to  determine  the  final  response  set  from  the  list  of 
eligible  tuples. 

For  a given  query,  a single  predicate  of  a predicate  conjunct  In  the  query 
may  be  used  for  determining  the  eligible  TIDs.  After  the  corresponding  tuples 
are  retrieved,  they  are  placed  In  the  final  response  set  oiily  If  they  satisfy 
all  the  other  predicates  In  the  predicate  conjunct. 

In  a DBG  environment,  the  execution  time  of  a query  consists  mainly  of 

(1)  hardware  search  time  of  the  structure  memory  to  determine  the 
eligible  MAUs,  and 
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Figure  7.5  (continued)  Directory  storage  ratio  R^j 


-61- 


(2)  the  time  to  search  each  eligible  MAU  for  records  satisfying  a 
predicate  conjunct. 

To  get  a handle  at  the  analysis,  we  make  the  following  practicable 
assumptions: 

(1)  For  every  MAU  accessed  by  the  DBG,  we  allow  for  an  extra  processing 
time  in  the  structure  memory  (In  order  to  determine  the  Index  terms 
and  thus  the  MAU  numbers).  Therefore,  a constant  factor  K > 1 will 
be  used  to  multiply  the  number  of  MAU  accesses,  thereby  accounting 
for  query  processing  time  In  the  structure  emmory. 

(2)  Binary  search  of  the  Index  pages.  In  a GFC  environment,  takes  a 
negligible  amount  of  time  compared  to  the  time  to  access  each  page. 

(3)  The  time  to  access  an  Index  page,  the  time  to  access  a data  page 
and  the  time  to  access  an  MAU  are  all  equal  (equal  to  the  latency 
time  plus  rotation  time  needed  to  access  a disk  cylinder) . 

In  the  ensuing  discussion,  we  consider  the  two  most  lnq>ortant  types  of 
queries:  single-relation  queries  and  two-relation  join  queries.  The  analysis 
Is  in  the  style  of  [5].  The  time  to  execute  a query  Is  determined  in  terms  of 
the  number  of  accesses  to  the  physical  blocks. 

7.3.1.  Single -Relation  Queries 

A single-relation  query  is  exemplified  by  the  following  SEQUEL  quftry 
which  lists  the  names  and  salaries  of  programmers  who  earn  more  than  $10,000: 

SELECT  NAME, SAL 

FROM  EMP 

WHERE  J0B= ’ PROGRAMMER ' 

AND  SAIolOOOO 

This  Is  an  example  of  a single -relation  query  with  a single  predicate  conjunct. 

In  the  general  case,  there  can  be  a disjunction  of  X predicate  conjuncts,  but 
then  the  query  may  be  treated  as  X queries  each  with  a single  predicate  conjunct. 
We,  therefore,  only  restrict  ourselves  to  queries  with  a single  predicate 
conjunct.  Furthermore,  the  predicates  are  assumed  to  be  simple  predicates 
(l.e.,  the  predicates  are  simple  comparisons  of  a field  with  a value)  so  that 
they  can  be  matched  with  an  Image.  More  coaqillcated  predicates,  such  as 
EMP  X«>K}R  - EMP  Y.EMPN0,  cannot  be  matched  by  an  image.  Finally,  since  the 
consideration  of  links  Involves  a straightforward  extension  of  the  analysis 
given  below,  we  will  only  consider  Images. 

The  following  notations  are  Introduced  to  simplify  the  ensuing  discussion: 
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n = relation  cardinality; 

p = number  of  predicates  in  the  query; 

h = coefficient  of  CPU  time  (1/h  is  the  number  of  tuple  comparisons 
which  are  considered  equivalent  in  cost  to  one  page  access) ; 

i = image  cardinality; 

K = coefficient  of  DEC  structure  memory  processing  time,  the  time 
required  to  determine  index  terms  (K  > 1) ; 

f = # of  index  page  accesses  per  index  search  in  the  GPC  environment  (for 
a given  storage  device  and  given  key  length,  it  is  a function  of  the 
relation  cardinality  n and  the  image  cardinality  i,  but  normally 
has  a value  lying  between  2 and  4) ; 

B = average  number  of  tuples  (of  a relation)  per  data  page  (subscript  g 

O 

refers  to  the  GPC  environment) ; 

= average  number  of  records  per  MAU  (subscript  d refers  to  the  DEC 
environment);  and 

j = average  number  of  MAUs  spanned  by  a cluster  in  the  DEC. 

The  optimizer  in  System  R,  has  the  option  to  select  an  access  strategy 
among  a variety  of  choices.  The  most  important  of  these  are  listed  below. 

In  each  case,  the  execution-time  ratio  may  be  determined  by  computing  the 
ratio  of  the  time  Tg  required  to  execute  a query  in  the  GPC  environment  to 
the  time  T^  required  in  the  DEC  environment. 

Case  1,  A clustering  image  is  available  which  matches  a predicate  with  the 
comparison  operator  '=’.  Since  the  expected  number  of  tuples  that  satisfy 
the  predicate  is  n/i,  the  expected  number  of  data  pages  to  be  accessed  in 
the  GPC  environment  is  n/(iBg).  Since  each  of  the  retrieved  tuples  must  now 
be  compared  against  the  other  (p-1)  predicates,  the  total  time  required  in 
the  GPC  environment  is 

Tg  = n/(lBg)  + (p-l)hn/l  + f . 

Tg  may,  in  actuality,  be  somewhat  less  because  some  of  the  retrieved 
tuples  may  have  to  be  eliminated  from  further  consideration  even  before  all 
the  (p-1)  predicates  have  been  compared  with  them.  Furthermore,  since  the 
number  of  tuples  retrieved,  which  is  n/i,  is  expected  to  be  very  small,  we 
may  even  neglect  the  CPU  time  required  for  comparing  predicates.  Therefore, 

Tg  is  simplified  to 

Tg  = n/(iBg)  + f . 

In  the  DEC  environment,  whenever  the  equality  predicate  matches  a 
clustering  image,  only  one  cluster  need  be  searched.  Therefore, 


Td  - JK 

where  the  factor  K accounts  for  the  structure  memory  processing  time.  Finally, 
the  execution-time  ratio  Is 

\ " " n/(iBgjK)  + f/(jK) 

Case  2»  A clustering  Image  Is  available  which  matches  a predicate  whose 
coiq>arlson  operator  Is  not  Assuming  that  half  the  tuples  of  the  relation 

satisfy  the  predicate,  the  expected  times  are 

Tg  - n/(2Bg)  + (p-l)hn/2  + f 

and 

Tg  - nK(2Bj) 

Case  3.  A non-clustering  Image  Is  available  which  matches  a predicate  whose 
comparison  operator  Is  If  this  Image  Is  used  In  a GPC  environment,  then 

one  page  access  will  be  required  for  each  of  the  n/i  expected  tuples  that 
satisfy  the  predicate.  Without  the  advantage  of  secondary  clustering  Information 
(In  the  query),  the  DBC  has  to  access  the  entire  relation.  Therefore, 

Tg  ■ n/1  + (p-l)hn/l  + f 

and 

Td  - nK/Bd 

Case  4.  A non-clustering  Image  Is  available  which  matches  a predicate  whose 
comparison  operator  Is  not  If  this  Image  Is  used  In  the  GPC  environment, 

then 

Tg  - n/2  + (p-l)hn/2  + f 

and 

Td  » nK/Bd 

Case  5.  A clustering  Image  la  used  which  matches  no  predicate.  In  this  case, 
all  the  tuples  must  be  examined  In  the  GPC  environment.  Therefore, 

T = n/B  + phn  + f 

O O 

and 

Td  - nK/Bd 

Case  6.  A non-clustering  Image  Is  used  which  matches  no  predicates.  Since, 
we  may  Justifiably  assume  that  every  relation  has  a clustering  Image  (or 
cliisterlng  link) , this  choice  will  actually  never  have  to  be  made  In  a GPC 
environment.  In  any  case.  If  the  choice  were.  Indeed,  to  be  made,  then 
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and 


Tg  = n + phn  + f 


Td  = nK/Bd 

Case  7,  Suppose  there  are  Pg2l  equality  predicates  and  Pn-1  non-equality 
predicates  each  of  which  matches  an  Image,  then  the  (pg  + p^j)  Images  may  be 
searched  and  a TID  list  generated  for  each  predicate.  These  lists  may  be 
sorted  separately  and  then  Intersected  to  determine  the  final  TID  list  to  be 
searched.  We  then  get, 

Tg  = (n/(lPe2Pn)  + (p^  + p^)f 

where  we  have  neglected  the  predicate  comparison  time  since  the  final  list 
of  TIDs  will  be  very  small;  we  have  also  neglected  the  time  to  sort  the  TID 
lists,  which  may  be  appreciable  If  the  lists  are  actually  quite  long.  Notice 
that  when  p 22,  the  first  term  In  T„  Is  likely  to  be  quite  small  as  long  as 

C o 

the  Image  cardinalities  are  moderately  large.  In  such  a case,  we  may  write 
In  the  DBG  environment,  we  have 

Td  = JK,  If  an  equality  predicate  matches  a clustering  Image  and 
Td  = nK/Bd » otherwise. 

In  Figure  7.6,  we  have  tabulated  the  values  of  execution  time  ratio  R^.  for 
each  of  the  seven  cases  mentioned  above.  We  have  used  the  following  figures: 

K = 1.2 
f = 3 

p = 2 

h = 0.0001 
j = 2 

Bd/Bg  = 50 

Bg  Is  taken  from  the  set  {5,20,100,500} 
n/1  Is  taken  from  the  set  {1,2,5,10,50,100} 
n Is  taken  from  the  set  {1000,5000,20000,100000} 

(Pg  + Pjj)  Is  taken  from  the  set  {2,3}. 


The  assumption  of  Bj/B  = 50  requires  a little  explanation.  An  MAU  In 
the  DBC  Is  a disk  cylinder  which  normally  consists  of  20  to  40  tracks.  The 
track  size  to  page  size  ratio  In  a conventional  system  usually  varies  from 
1 to  5.  Finally,  the  size  of  a DBC  record  varies  from  1 to  2 times  the  size 
of  the  corresponding  tuple  of  a conventional  system.  Taking  these  factors 
into  consideration,  we  have  arrived  at  a reasonable  figure  of  50  for  the  ratio 
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Figure  7.6.  Execution  time  ratio  R^.  for  single-relation 
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Figure  7,6,  (continued)  Execution  time  ratio  R^ 


We  observe  a number  of  Important  facts  from  the  tables  in  Figure  7.6. 
Whenever  there  is  an  equality  predicate  matching  an  image  (e.g..  Case  1 and 
Case  3),  very  few  pages  need  to  be  searched  in  the  GPC  environment,  because 
of  the  choice  of  large  image  cardinalities.  Therefore,  in  these  cases  the 
term  f dominates  the  value  of  Tg.  In  Case  1,  the  DEC  has  to  search  only  one 
cluster  because  the  equality  predicate  matches  a clustering  image.  In  Case  3, 
however,  the  DEC  has  to  content-search  the  entire  relation.  So,  for  very 
large  relations  and  very  large  records,  the  GPC  environment  is  clearly  more 
favorable  in  Case  3,  Similar  reasoning  holds  for  Case  7 if  the  clustering 
image  does  not  match  even  one  of  the  equality  predicates.  In  all  other 
cases,  the  DEC  performs  one  or  more  orders  of  magnitude  better  than  a 
conventional  system.  In  short,  the  DEC  works  much  better  than  a conventional 
system,  whenever  any  one  of  the  following  holds: 

(1)  record  size  is  small,  say  50  to  200  bytes, 

(2)  relation  is  of  small  or  medium  size,  say  less  than  20,000  tuples, 

(3)  many  records  (say,  greater  than  50)  are  satisfied  by  an  equality 

predicate,  so  that  many  records  have  to  be  retrieved  by  either 

system, 

(4)  image  cardinality  is  medium,  say  n/i  > 100,  which  will  easily  be 
true  for  large  relations  (this  observation  actually  follows  from  3) , 
or 

(5)  a given  query  does  not  have  any  equality  predicate  that  matches 
an  image. 

The  GPC  environment,  in  contrast , works  out  as  good  or  better  than  the  DEC 
only  when  all  the  following  conditions  hold: 

(1)  the  relation  is  large,  say  greater  than  20,000  tuples, 

(2)  the  records  are  large,  say  500  bytes  or  larger, 

(3)  the  query  has  an  equality  predicate  that  matches  an  image,  and 

(4)  the  cardinality  of  the  above  Image  is  very  large,  say  n/i  £ 10. 

7.3.2  Queries  Involving  a Join  of  Two  Relations 

While  the  most  common  type  of  SEQUEL  query  is  the  single-relation  query 
(because  such  queries  appear  both  as  simple  statements  and  also  as  embedded 
queries  within  compound  SEQUEL  statements),  the  second  most  frequently  used 
query  is  possibly  the  one  involving  a join  of  two  relations.  An  example  of 
such  a query  is  as  follows.  It  lists  the  names,  salaries  and  department 
names  of  programmers  located  in  Evanston. 
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SELECT 

NAME,  SAL,  DNAME 

FROM 

EMP,DEPT 

WHERE 

EMP.JOB  - ’PROGRAMMER' 

AND 

DEPT.LOC  - 'EVANSTON' 

AND 

EMP.DNO  - DEPT.DNO 

The  most  general  form  of  a join  query  Involves  restriction,  projection  and 

join.  The  general  query  has  the  form: 

Apply  a given  restriction  (which  is  a single-relation  subquery  Ql) 
to  a relation  Rl,  yielding  a set  of  tuples  Rlj..  Apply  a possibly 
different  restriction  (which  Is  another  single-relation  subquery  Q2) 
to  a relation  R2,  yelldlng  R2j..  Join  Rlj.  and  R2j.  and  project  some 
fields  to  derive  the  final  response  set. 

The  optimizer  of  System  R determines  an  access  strategy  for  such  queries, 
based  on  the  characteristics  of  the  two  relation.  Four  possible  methods  are 
shown  in  [5]  but  the  full  details  of  the  optimizer,  as  well  as  the  justification 
of  the  various  methods,  have  not  yet  been  published.  Based  on  a study  of  the 
above  four  methods,  we  will  describe  a number  of  different  cases.  Once  again. 
Images  alone  will  be  considered.  Consideration  of  links  Involves  a straight- 
forward extension  of  the  techniques  described.  Because  of  the  enormous  number 
of  ways  a join  query  may  be  executed,  we  will  often  generalize  a number  of 
possibilities  by  describing  them  in  terms  of  the  individual  single-relation 
subqueries  Ql  and  Q2.  In  such  circumstances,  we  will  formulate  the  query 
execution  time  In  terms  of  the  execution  times  of  Ql  and  Q2. 

The  following  .nomenclature  is  used  In  the  analysis; 

®1»E2  ” average  number  of  external  nodes  In  a B-tree  of  relations 
Rl  and  R2,  respectively  ; 

^1»^2  * best  execution  times  for  separate  executions  of  the  subqueries 
Ql  and  Q2  In  the  GPC  environment  ; 

Di»D2  “ the  best  execution  times  for  separate  executions  of  the  subqueries 
Ql  and  Q2  in  the  DBC  environment  ; 

’^1*^2  “ cardinalities  of  relations  Rl  and  R2,  respectively  ; 

* average  Image  cardinality  of  images  on  attributes  of  Rl  and  R2, 
respectively  ; 

f * (/*  of  Index  page  accesses  for  searching  a given  key  In  the  GPC 
environment  ; 

j ~ average  number  of  MAUs  spanned  by  a cluster  In  the  DBC  ; 

K ■ coefficient  of  DBC  structure  memory  processing  time  ; 

Bg  ~ average  number  of  tuples  per  data  page  (In  GPC  environment)  ; 

Bjj  ■ average  number  of  records  per  MAU  (In  DBC  environment)  ; 
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Tg  « execution  time  of  a join  query  (in  GPC  environment); 

“ execution  time  of  a join  query  (in  DBG  environment);  and 

“ execution  time  ratio  Tg/T^  , 

Since,  the  number  of  secondary  storage  accesses  required  for  a join 
query  is  usually  quite  large,  we  will  Ignore  the  CPU  time  required  for 
comparing  tuples  against  predicates  in  the  GPC  environment.  With  this 
simplification  in  mind,  the  various  cases  are  discussed  below. 

Case  1.  If  clustering  Images  are  available  on  both  the  join  attributes,  if 

there  is  an  equality  predicate  in  each  of  Q1  and  Q2  and  if  there  is  an  image 

on  each  of  the  attributes,  A1  and  A2,that  match  these  equality  predicates, 
then  the  following  algorithm  (called  the  TID  algorithm)  may  be  used  in  the 
GPC  environment: 

Using  the  image  on  Al,  obtain  the  TIDs  of  tuples  from  R1  which  satisfy 
the  equality  predicate  of  Rl.  Sort  them  and  store  the  TIDs  in  a file 
Wl.  Do  the  same  with  R2,  using  the  image  on  A2  and  yielding  a TID 
file  W2.  Perform  a simultaneous  scan  over  the  Images  on  the  join 
attributes  of  Rl  and  R2,  finding  the  TID  pairs  of  tuples  which  match 
on  the  join  attributes.  Check  each  pair  (TIDl,  TID2)  to  see  if  TIDl 
is  present  in  Wl  and  TID2  in  W2.  If  they  are,  the  tuples  are  fetched, 
joined  and  projected. 

The  time  to  execute  this  algorithm  consists  of  the  time  to  search  the  images 
on  Al  and  A2,  the  time  to  scan  entire  clustering  Images  (on  the  join  attributes) 
and  the  time  to  retrieve  the  required  tuples  from  the  database.  Neglecting 
the  sorting  time,  the  approximate  expression  for  Tg  is 

Tg  = 2f  + (El  + E2  + 2f)  + (nj^/i^)  + 

= 4f  + + (n^/i^)  + (n2/i2)- 

I In  the  DBC  environment,  in  the  worst  case,  it  may  be  required  to  use  a 

CONNECT  ON  command,  thus  focing  a content-search  of  entire  relations  Rl  and 
R2.  Therefore,  in  the  worst  case 

Td  = K(n^  + 02)  / Bj. 

However,  if  n^/ij^  is  small,  then  it  may  be  faster  to  retrieve  the  records 
satisfying  the  subquery  Ql,  sort  these  records  by  the  join  attribute  and  send 
a separate  command  for  each  unique  value  of  the  join  attribute,  in  order  to 
retrieve  the  records  satisfying  Q2,  This  process  leads  to 

Tj  = Knj^/Bd  + Kj(ni/ij^). 

On  the  other  hand,  if  n2/i2  is  small,  then  we  may  start  by  retrieving  the 
records  satisfying  Q2  and  then  use  the  unique  values  of  the  join  attribute  of 
these  records  to  retrieve  the  records  satisfying  Ql.  We  then  have 

Td  - Kn2/Bd  + Kj(n2/i2). 


]: 
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In  Figure  7.7,  we  have  tabulated  ■ T /T,  for  the  worst-case  value  of 

t g d 

T^.  We  have  used  the  following  values:  f“3;  n*nj^"n2  taken  from  the  set 
{1000,  5000,  20000,  100000}  - n/500; 

set  {1,  5,  20,  100}  B taken  from  the  set  {250,  1000,  5000,  25000}  and  K=1.2. 

d 

We  observe  that  the  DBG  performs  one  or  more  orders  of  magnitude  better  except 

when  record  size  is  very  large,  say  greater  than  500  bytes  (corresponding  to 

B,  = 1000). 
a 

Case  2.  In  case  there  are  clustering  images  on  both  the  join  attributes  but  no 
other  predicate  matches  an  image,  then  the  following  algorithm  may  be  used  in 
the  GPC  environment: 

Perform  a simultaneous  scan  of  the  images  on  the  join  attributes  of  the 
two  relations.  Advance  the  R1  scan  (retrieving  the  tuples  of  R1  at  the 
same  time)  until  the  next  tuple  is  found  which  satisfies  Ql.  Let  the 
join  attribute  of  this  tuple  have  a value  V.  Advance  the  R2  scan  and 
fetch  all  tuples  of  R2  that  have  a value  V for  the  join  attribute  and 
satisfy  Q2.  Repeat  until  the  image  scans  are  completed.  (By  inter- 
changing R1  with  R2  and  Ql  with  Q2  in  the  statement  of  the  algorithm, 
we  may  get  yet  another  algorithm) . 

Assuming  that  there  is  an  equality  predicate  in  Ql,  whose  attribute  has  a 
cardinality  ij^,  the  time  required  to  execute  the  join  query  consists  of  the 
time  to  scan  completely  both  the  clustering  Images,  the  time  to  access  all 
tuples  of  R1  and  the  time  to  access  (nj^/ij^)  tuples  of  R2.  Therefore, 

Tg  = (E^  + f)  + (E^  + f)  + nj^/Bg  + nj^/ij^ 

The  worst  case  value  of  Tj,  on  the  other  hand,  is 

^d  “ + n2)/Bj 

If  is  small,  we  may  use  the  method  shown  in  Casel,  to  derive  an  actually 

better  performance. 

In  Figure  7.8,  we  have  tabulated  R,.  = T /T.  for  the  worst-case  value  of  T,. 

t g d d 

The  values  used  in  the  calculations  are:  f=3;  k=1.2;  n=n^“n2  taken  from  the 

set  {lOOO,  5000,  20000,  100000  };  E=E^=E2®n/500;  l=lj^=i2,  n/1  taken  from  the  set 

{1,  5,  20,  100}  B,/B  = 50;  B taken  from  the  set  (5,  20,  100,  500}.  We  notice 
d g g 

that  the  DBG  demonstrates  a performance  that  is  uniformly  better  than  that  of  a 
conventional  system. 


Case  3.  If  there  is  a clustering  image  on  the  join  attribute  of  Rl,  and  if  there 
is  no  clustering  image  on  the  join  attribute  of  R2,  then  a conventional  system  may 
first  retrieve  the  restriction  of  R2  (by  executing  the  single-relation  subquery  Q2 
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Figure  7.8.  Execution  time  ratio  for  a Join  query 
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In  the  best  possible  way),  sort  these  tuples  by  the  join  attribute  of  R2  and  then 
use  these  values  (of  the  join  attribute)  for  scanning  the  clustering  image  of  Rl. 
In  this  case,  4f  there  is  at  least  one  equality  predicate  in  Q2,  then 

'^8  “ ^2  ®1 

where  we  have  assumed  that,  because  of  the  clustering  image  of  Rl,  only  one 
access  is  required  for  each  unique  value  of  the  join  attribute. 

The  worst-case  value  of  is  still 

Td  = k(ni  + n2)/Bd. 

If  ^2^^2  small,  then,  in  the  DBG  environment,  it  may  be  faster  to  first 
retrieve  the  records  satisfying  Q2  in  the  best  possible  way,  and  then  use  the 
unique  values  of  the  join  attribute  of  these  records  to  fetch  the  records  of 
Rl.  In  that  case, 

= D2  + (n2/i2)jk. 

If  (n2/±2)  is  very  small  and  also  much  less  than  D2»  then  we  have 

Rt  (G2  + Ej^)/D2 

If  (,n2/i-2^  very  small  but  not  much  less  than  D2»  then  the  minimum  value  of 
R^  is 

Thus,  whenever  (n2/i2)is  small,  the  magnitude  of  R^.  is  at  least  of  the  order  of 
value  that  can  be  expected  for  the  single-relation  query  Q2o  For  large  values 
of  (02/12) » T is  at  least  of  the  same  order  as  T^,  We  do  not  tabulate  any 
values,  because  G2  and  D2  are  unknown  quantities  depending  on  the  execution 
time  of  single-relation  queries,  specific  cases  of  which  have  already  been 
treated  in  Section  7.3.1 

Case  4.  This  does  not  involve  any  novel  situation.  This  case  is  basically 
similar  to  Case  3,  and  it  can  be  derived  from  Case  3 by  interchanging  Rl  with 
R2  and  Q1  with  Q2.  The  analysis  is  also  similar. 


Case  5.  As  a final  possibility,  if  there  is  no  clustering  image  on  either  of 
the  join  attributes,  then  the  following  algorithm  may  be  used  in  the  GPC 
environment  as  well  as  in  the  DBC  environment: 

Using  the  best  possible  method,  retrieve  the  restriction  of  Rl  and  sort  the 
records  by  their  join  attributes.  Do  the  same  for  the  restriction  of  R2. 
Join  the  two  sets  of  records. 


In  the  DBG  environment,  however,  the  best  possible  way  is  the  usual  way  of 
expressing  the  commands.  For  example,  if  a predicate  in  Ql  matches  a clustering 
attribute,  then  the  DBG  automatically  uses  its  directory  to  guide  its  search. 
Furthermore,  this  entire  algorithm  may  be  expressed  by  a single  DBG  command 
using  the  GONNEGT  ON  clause. 

In  this  case,  we  have 

Tg  = + G2  and  Td  = + D2. 

Therefore,  the  execution  time  ratio  in  this  case  is  of  the  same  magnitude  as 
would  be  obtained  for  single-relation  queries. 

7.4  Summary 

In  this  section  we  have  compared  the  performance  of  two  different  systems 
that  Implement  the  same  relational  model  of  data.  We  have  expressed  the 
performance  ratio  in  terms  of  various  database  parameters  such  as  clock  length, 
record  size,  relation  cardinality,  image  cardinality,  etc.  Typical  values  of 
these  parameters  have  then  been  inserted  into  these  expressions  to  obtain  some 
concrete  numerical  results.  The  result  of  these  computations  is  as  follows: 

(1)  The  mass  memory  used  by  the  DBG  typically  varies  from  1 to  2 times 
that  used  by  a conventional  system, 

(2)  The  directory  memory  required  in  the  DBG  environment  is  t5T)ically 
one  or  two  orders  of  magnitude  less  than  that  required  in  the  GPG 
environment  (by  a conventional  system) . 

(3)  The  execution  time  required  for  usual  SEQUEL  queries  is  normally  one 
or  two  orders  of  magnitude  faster  when  the  DBG  is  used. 
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8.  CONCLUDING  REMARKS 

This  report  brings  to  conclusion  the  second  phase  of  a study,  the  goal 
of  which  is  to  design,  construct  and  evaluate  a hardware  architecture  that  can 
support  existing  database  management  activities.  After  initial  contemplation 
[17,18]  on  the  viability  of  such  a goal,  it  was  decided  that  a large-scale 
database  machine  may  not  actually  be  out  of  reach  either  from  the  cost  stand- 
point or  from  the  technological  standpoint.  The  most  important  database  activity 
is  decidedly  the  search  of  data  by  their  content.  To  store  a very  large  database 
on  a monolithic  associative  memory  is  an  enormously  costly  undertaking.  Furthermore, 
at  any  given  moment,  only  a small  fraction  of  the  database  (therefore,  the  memory) 
is  actually  in  use.  The  solution  envisioned  for  this  study  is  a partitioned 
content-addressable  memory  (PCAM)  where  each  partition  or  block  is  individually 
content-addressable  [9]  and  only  one  block  is  accessible  at  any  given  time.  This 
removed  one  of  the  cost  problems  facing  the  construction  of  database  machines. 

But  the  use  of  PCAM  for  mass  storage  gave  rise  to  another  problem.  In  order  to 
execute  a user  transaction  with  a reasonably  good  response  time,  it  should  be 
possible  to  efficiently  identify  the  blocks  relevant  to  the  transaction  and 
thereby  avoid  an  exhaustive  search  of  the  entire  database  store.  This  necessitated 
the  maintenance  of  directories  in  a faster  storage  which  is  at  least  one- 
hundredth  the  size  of  the  database  storage.  To  ensure  fast  response,  the 
directory  storage  (called  structure  memory)  should  also  be  a PCAM.  The  answer 
to  such  a storage  seemed  to  lie  in  one  of  the  emerging  memory  technologies, 
such  as  magnetic  bubble  memory,  electron  beam  addressable  memory  or  charge- 
coupled  devices  [8]. 

The  second  phase  of  the  study  was  aimed  at  demonstrating  that  the 
architectural  design  of  the  DBC  is  effective.  By  effective  we  mean  that  the  DBC 
ought  to  be  capable  of  supporting  the  most  important  data  models,  namely, 
hierarchical,  network  and  relational,  in  a manner  that  it  at  least  outperforms 
conventional  computers  and  systems.  That  the  DBC  clearly  outperforms  a 
conventional  hierarchical  or  network  database  management  system  has  been 
demonstrated  in  [1,2].  The  current  report  concludes  this  work  by  considering 
Che  relational  data  model  and  system. 

To  complete  the  perspective,  we  brlefxy  outline  the  future  works  envisioned 
’ •t>idv.  The  next  phase  consists  of  bringing  the  individual  components 
>•  . iwr*  to  the  logic-design  level.  The  design  and  testing  of  all 

• ) me  at  the  same  time.  Following  this,  we  plan  to  construct 
. aleij-down  prototype  of  the  DBC.  The  final  phase 


1 


consists  of  an  evaluation  of  the  DBC  by  supporting  on  It  an  existing  database. 


In  this  report,  as  has  been  Indicated  earlier,  we  have  tried  to  show  how 
well  a relational  database  management  system  can  be  supported  on  the  DBC. 

System  R has  been  considered  as  a typical  relational  system.  We  have  shown  how 
user  queries  written  In  the  data  sublanguage  called  SEQUEL,  are  represented  in 
the  form  of  DBC  commands.  Other  statements  in  SEQUEL,  for  example,  those  used 
for  updating  the  database  are  similarly  translated.  The  clustering  Information 
provided  In  the  schema  in  the  form  of  clustering  images  and  clustering  links  are 
appropriately  used  in  clustering  the  DBC  records.  It  has  also  been  shown  how 
the  view  mechanism.  Integrity  assertions,  triggers  and  authorization  can  all  be 
supported  on  the  DBC, 

Finally,  in  this  report,  we  have  made  a performance  analysis  in  which  we 
have  compared  the  storage  requirements  and  query  execution  times  of  a conventional 
relational  system  versus  a DBC-supported  relational  system.  It  has  been  observed 
that  while  the  mass  memory  requirement  in  the  DBC  is  usually  between  one  and 
two  times  the  requirement  in  a conventional  system,  there  is  a tremendous  saving 
in  directory  memory  and  very  large  reduction  in  the  execution  time  of  queries 
when  a DBC  is  being  used.  Specifically,  the  usual  directory  memory  requirement 
and  query  execution  times  are  likely  to  be  one  or  more  orders  of  magnitude  better 
than  those  of  a conventional  system.  The  reason  for  this  performance  enhancement 
lies  in  the  very  large  size  of  the  DBC  mass  memory  blocks,  content -addressability 
of  each  block  and  the  clustering  of  DBC  records  primarily  by  relation  names. 
Because  of  very  large  block  sizes,  directories  are  small  and  every  mass  memory 
access  allows  the  DBC  to  Inspect  a very  large  number  of  records . Because  of 
the  content-addressability  of  each  block,  the  response  set  of  a query  is  usually 
the  same  set  of  records  as  returned  by  the  DBC.  Therefore,  no  added  CPU  time 
Is  needed  to  compare  the  retrieved  records  against  the  predicates  that  form  the 
query.  Clustering  of  all  records  belonging  to  any  given  relation  ensures  that 
any  single-relation  query,  whatever  Its  composition,  will  require  at  most  as 
many  mass  memory  accesses  as  there  are  blocks  occupied  by  the  relation.  Further 
speed  gains,  which  do  not  show  up  In  the  analysis,  follow  due  to  the  various 
other  functional  features  of  the  DBC  such  as  hardware  sorting,  automatic  memory 
management,  and  hardware  to  compute  the  common  set-functions  such  as  average. 
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APPENDIX  A — NORMAL  FORMS  OF  RELATIONS 


We  shall  start  with  some  definitions.  For  a relation  R consider  two 
attributes  A and  B.  If,  at  every  instant  of  time,  for  every  value  of  A, 
there  exists  exactly  one  value  of  B in  the  tuples  of  R,  then  B is  said  to 
be  functionally  dependent  on  A,  written  A-^B.  This  definition  is  generalized 
in  the  obvious  way  to  functional  dependencies  involving  compound  attributes . 

An  attribute  set  (compound  attribute)  X of  a relation  R is  said  to  be 
a key  of  R if  every  attribute  of  R is  functionally  dependent  on  X and  if 
no  subset  of  X has  this  property.  An  attribute  that  appears  in  any  key 
of  R is  called  a prime  attribute  of  R,  All  other  attributes  are  non-prime. 

The  keys  of  a relation  are  normally  underlined. 

If  there  are  three  attribute  sets  X,Y,Z  such  that  X ^Y,  X-»Z  and  Y->Z, 
then  Z is  said  to  be  partially  dependent  on  Y.  If  Y-^Z  and  there  exists  no 
X cY  such  that  X~^Z,  then  Z is  fully  dependent  on  Y. 

An  attribute  A is  transitively  dependent  upon  a set  of  attributes  X 
if  there  exists  a set  of  attributes  Y such  that  X~^Y,  Y^^X  and  Y-^A,  where 
Aj(X,  A |(  Y and  A,X,Y  are  taken  from  the  attributes  of  a single  relation  R. 

A relation  R is  in  first  normal  form  (or  INF)  if  every  attribute  A 
assumes  values  from  only  single-valued  domains.  That  is,  relation-valued 
domains  are  excluded  from  relations.  The  advantage  of  a INF  relation  over  a 
general  relation  is  that  the  database  can  be  viewed  as  a collection  of  simple 
tables  (instead  of  'tables'  of  tables)  so  that  a small  class  of  operations 
is  applicable  to  all  relations  in  the  database. 

The  second  and  third  normal  forms  are  Introduced  to  eliminate  certain 
update  anomalies.  Consider  the  functional  dependencies  of  a database  as 
shown  in  Figure  A.l.  If  we  create  a INF  relation  R (PRODUCT.  SUPPLIER.  QTY, 
CITY,  STATE) , then  insertion-deletion  anomalies  can  occur  due  to  the  partial 
dependency  of  CITY  on  the  key  {PRODUCT, SUPPLIER }.  Thus,  when  a supplier  is 
supplying  no  parts,  his  city  of  origin  cannot  be  recorded  in  the  database. 

A relation  is  in  2NF  if  it  is  in  INF  and  each  of  its  non-prime  attributes 
is  fully  dependent  upon  every  key.  Thus,  the  database  of  Figure  A.l  can 
be  represented  by  two  2NF  relations: 

R1 (PRODUCT.  SUPPLIER.  QTY) 
and  R2 (SUPPLIER.  CITY,  STATE) 

There  is  still  another  problem  with  R2,  even  though  it  is  in  2NF.  Since 
STATE  is  transitively  dependent  on  SUPPLIER  via  CITY,  any  time  the  last 
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tuple  for  a city  is  removed  from  R2,  the  corresponding  city-state  association 
is  simultaneously  destroyed. 

A relation  is  in  3NF  if  none  of  its  non-prime  attributes  are  transitively 
dependent  upon  any  key.  The  2NF  relation  R2  can  be  split  up  into  two  3NF 
relations : 

R3  (SUPPLIER.  CITY) 
and  R4(CITY,  STATE) 

Further  normalization  to  a revised  version  of  3NF  relations  (called  the 
Boyce-Codd  Normal  Form)  and  to  4NF  relations  (involving  multiple  dependencies) 
may  be  found  in  [4,12], 


PRODUCT,  SUPPLIER  QTY 
SUPPLIER ->•  CITY 
CITY -►'STATE 

Figure  A.l.  Functional  dependencies  in  a Product-Supplier  database. 


